Speaking to the iPhone (pfft 4S)?
So the iPhone 5 is almost upon us and we've been hearing a lot of rumor around Apple's "Assistant"- the technology the acquired and have almost certainly developed from Siri.
It's time again to test ourselves against Apple's legendary product design and to be honest enough to put a post out for posterity to test our predictions. Here are mine.
This could be a watershed for me- let me introduce a fundamental belief of mine. Speech and physical interaction are primary in humans, symbolic manipulation like reading, writing and math secondary. Children spontaneously learn to navigate the physical world and manipulate the objects within it but we have to spend years of training on reading, writing and a'withmetic.
Apple will introduce a phone with an integrated voice based command and control system that will reach modestly into a few application on the phone. You will also be able to conduct searches across some known domains: music, movies and local restaurants perhaps. It will be similar to Mango and move things on a little. A new API will be offered and a few example applications trotted out. I doubt that it will make the grade of "mainstreaming artificial intelligence", being more of an increment from current speech interfaces. What is remarkable is both Microsoft and Apple bringing such interfaces to the mass market.
Let's look at three types of speech system: Flowchart, Slot & Filler, Statistical.
Flow charts
These are the really annoying ones. Flow chart systems are the type that a typical GUI developer would understand. For every interaction there is a predetermined and preprogrammed path that the software will take. Although there are a number of templates (a few for each service) the software is taking fixed paths through the decision graph.
Slot & Filler
Here's the system has a goal for which it needs to collect a bunch of information. You might say, "I'd like to travel to Boston tomorrow at 3pm". And the system would ask, "OK, where are you leaving from?". "Oh, New York", you say. This system has considerably more flexibility that the flow chart based one, but it's still pretty much a fixed template, simply that the user can under- or oversupply information up front.
I expect Apple to introduce a system like this. It will be hand-woven into the GUI by developers. There will be some key places where one can interact with speech, depicted by a microphone button. It's also possible there will be a central button that may be pressed to gain access to dialogues in other apps, "Listen to music".
Statistical dialog systems
These are the research systems of today. Rather than having some fixed path or paths with a number of slots to fill in, these systems are trained on hundreds of thousands of typical dialogs. Instead of looking for specific patterns of words they will progress on a fuzzier, here's what I think you said basis. They have a notion of a distant goal (within each domain) and can progress to that goal earning a number of rewards on the way. Maximizing this reward (a proxy for pleasing the user) is their game. They have a notion of confidence in what they thought the user meant, and can progress through a number of means: saying something, assuming something, showing the user something (like a prompt, dialog, or other UI element- perhaps a map).
User: "I wanna listen to some music"
Phone: "OK, what genre", <show example genres on screen>
User: "Nah, play that album. You know, the Kings of Leon"
Phone: "Come Around Sundown, or Only by the Night?"
User: "Er, the first one"
Phone: <show track listing, play first track>
Despite the swagger of the user, all the extra words and the change of tack, the system still arrives at a desirable outcome.
I don't expect Apple to do this.
An aside: One curious thing about Siri. In one of their pitches I saw them say their major innovation was a way of connecting to lots of services, lots of APIs. That kind of treats the speech UI aspect as throwaway and makes me wonder what Apple actually got.
An API
Apple has within Mac OS X, an indication of how the speech services could work. Both the Services menu and Scripting API require an app expose a mini-language of things it is capable of. Each app could publish some templates for dialogues it can participate in. This is advanced, more likely is an API where one has to drive the dialogue with a number of basic functions such as speak(<prompt>), listen(<limited grammar>) and so on.
Challenges of speech: Acoustics, Language & Dialogue
Until now, I've talked about the dialogue aspect of speech systems- the structure and flow of the whole conversation. These systems are a considerable distance from the types of complex spoken interactions humans have. For the most part they are crude command and control systems, or a souped-up GUI control with speech input (a la Android search). Let's a have a quick look at some of the other aspects of speech interface.
Often when talking about speech people fixate on accent or pronunciation as a barrier for computer understanding. While it's true that the acoustic modelling research has had to overcome this boundary companies like Nuance and Google now have enough data to be able to deal with accents. More difficult for all systems is general noise: background noise, music, car noise, crowds, etc.
Beyond that is a question of word choice and dialect. Template systems will expect to find a small set of a particular forms of words. A statistical system should be less finickity providing the keywords needed are amongst the speech.
Finally there is the question of meaning. It doesn't really matter what we say, or indeed whether we speak at all, providing we a reliably understood. Pointing will suffice in some cases. The statistical systems come closest to this nirvana but even today's best systems are confined to limited known domains within tightly defined applications.
What missing?
The GUI gives you an idea as to how you might interact with it. Not so with speech. This is more like the bad old days of the learn - remember - and type command line interface. If you look at Mango you'll see that they a prompting the user with phrases on the screen- "You can say…call…delete…", etc.
That is to say you can't learn about the speech features through your phone's speech interface: this is not a conversation. At best it is a a shortcut, or a handy trick for a disadvantaged scenario: trying to text whilst driving for instance.
Say I'm buying a camera in a shop. I might expect the seller to spend some time educating me about the particular camera available, the things I can do with them and their specific features. Sometimes the seller leads, sometimes I do by asking questions, or making statements. The conversation ebbs and flows between us. Computers are clearly far from this, but consider something simpler like ordering a cup of coffee. Infact think of all the ways that a coffee was ordered around the world today. Now you have some sense of the complexity of even this micro interaction. One of many hundreds, perhaps thousands during our day.
So what will we see tomorrow?
A system wide speech command and control system that will extend to key apps and a few external apps. It will have templated dialogues for known domains and a basic API for developers to apply speech facilities to their own apps. There may be a limited attempt at integrating an application's capabilities, "I want to check in" [to 4-square] into the system. These will be accessed with a common button- perhaps a long press on home.
It will have a wearing and mechanical feel- frustrating if the recognition goes awry and with no ability to recover broken dialogues. A good trick, and one that gets usage, but ultimately has a feeling of being half-done. It will be an important milestone along the road of talking usefully with computers, but fall short of the techniques we will use and the interfaces we will need for talking to computers in general.
I'll stick my neck out and say that I think the personal assistant thing will be a flub, like Ping. It's not the idea which is a problem- it's a fine idea, but that the implementation is too brittle and has many integration difficulties. I think Norman Winosky has been overstating its capabilities, and he's been out of the product loop for a while now. You can learn about some of their thinking here.
On disintermediating Google. Apple has shown little capability in the mass data sphere (social, advertising) preferring scaled versions of curatorship like the app store approvals process. I expect them to continue this: speech on the iPhone will give you access to a small number of services rather than the web in general. Though they do have jobs advertising for, "Demonstrated experience with Nuance Recognizer, IBM WebSphere Voice, Google Voice, or similar voice search tool". That last bit is a bit mixed up, but let's assume they mean Google Voice Search. Perhaps in a future version.