Speaking to the iPhone (pfft 4S)?

So the iPhone 5 is almost upon us and we've been hearing a lot of rumor around Apple's "Assistant"- the technology the acquired and have almost certainly developed from Siri.

It's time again to test ourselves against Apple's legendary product design and to be honest enough to put a post out for posterity to test our predictions. Here are mine.

This could be a watershed for me- let me introduce a fundamental belief of mine. Speech and physical interaction are primary in humans, symbolic manipulation like reading, writing and math secondary. Children spontaneously learn to navigate the physical world and manipulate the objects within it but we have to spend years of training on reading, writing and a'withmetic.

Apple will introduce a phone with an integrated voice based command and control system that will reach modestly into a few application on the phone. You will also be able to conduct searches across some known domains: music, movies and local restaurants perhaps. It will be similar to Mango and move things on a little. A new API will be offered and a few example applications trotted out. I doubt that it will make the grade of "mainstreaming artificial intelligence", being more of an increment from current speech interfaces. What is remarkable is both Microsoft and Apple bringing such interfaces to the mass market.

Let's look at three types of speech system: Flowchart, Slot & Filler, Statistical.

Flow charts

These are the really annoying ones. Flow chart systems are the type that a typical GUI developer would understand. For every interaction there is a predetermined and preprogrammed path that the software will take. Although there are a number of templates (a few for each service) the software is taking fixed paths through the decision graph.

Slot & Filler

Here's the system has a goal for which it needs to collect a bunch of information. You might say, "I'd like to travel to Boston tomorrow at 3pm". And the system would ask, "OK, where are you leaving from?". "Oh, New York", you say. This system has considerably more flexibility that the flow chart based one, but it's still pretty much a fixed template, simply that the user can under- or oversupply information up front. 

I expect Apple to introduce a system like this. It will be hand-woven into the GUI by developers. There will be some key places where one can interact with speech, depicted by a microphone button. It's also possible there will be a central button that may be pressed to gain access to dialogues in other apps, "Listen to music".

Statistical dialog systems

These are the research systems of today. Rather than having some fixed path or paths with a number of slots to fill in, these systems are trained on hundreds of thousands of typical dialogs. Instead of looking for specific patterns of words they will progress on a fuzzier, here's what I think you said basis. They have a notion of a distant goal (within each domain) and can progress to that goal earning a number of rewards on the way. Maximizing this reward (a proxy for pleasing the user) is their game. They have a notion of confidence in what they thought the user meant, and can progress through a number of means: saying something, assuming something, showing the user something (like a prompt, dialog, or other UI element- perhaps a map).

User: "I wanna listen to some music"

Phone: "OK, what genre", <show example genres on screen>

User: "Nah, play that album. You know, the Kings of Leon"

Phone: "Come Around Sundown, or Only by the Night?"

User: "Er, the first one"

Phone: <show track listing, play first track>

Despite the swagger of the user, all the extra words and the change of tack, the system still arrives at a desirable outcome.

I don't expect Apple to do this.

An aside: One curious thing about Siri. In one of their pitches I saw them say their major innovation was a way of connecting to lots of services, lots of APIs. That kind of treats the speech UI aspect as throwaway and makes me wonder what Apple actually got.

An API

Apple has within Mac OS X, an indication of how the speech services could work. Both the Services menu and Scripting API require an app expose a mini-language of things it is capable of. Each app could publish some templates for dialogues it can participate in. This is advanced, more likely is an API where one has to drive the dialogue with a number of basic functions such as speak(<prompt>), listen(<limited grammar>) and so on.

Challenges of speech: Acoustics, Language & Dialogue

Until now, I've talked about the dialogue aspect of speech systems- the structure and flow of the whole conversation. These systems are a considerable distance from the types of complex spoken interactions humans have. For the most part they are crude command and control systems, or a souped-up GUI control with speech input (a la Android search). Let's a have a quick look at some of the other aspects of speech interface.

Often when talking about speech people fixate on accent or pronunciation as a barrier for computer understanding. While it's true that the acoustic modelling research has had to overcome this boundary companies like Nuance and Google now have enough data to be able to deal with accents. More difficult for all systems is general noise: background noise, music, car noise, crowds, etc. 

Beyond that is a question of word choice and dialect. Template systems will expect to find a small set of a particular forms of words. A statistical system should be less finickity providing the keywords needed are amongst the speech.

Finally there is the question of meaning. It doesn't really matter what we say, or indeed whether we speak at all, providing we a reliably understood. Pointing will suffice in some cases. The statistical systems come closest to this nirvana but even today's best systems are confined to limited known domains within tightly defined applications. 

What missing?

The GUI gives you an idea as to how you might interact with it. Not so with speech. This is more like the bad old days of the learn - remember - and type command line interface. If you look at Mango you'll see that they a prompting the user with phrases on the screen- "You can say…call…delete…", etc.

That is to say you can't learn about the speech features through your phone's speech interface: this is not a conversation. At best it is a a shortcut, or a handy trick for a disadvantaged scenario: trying to text whilst driving for instance.

Say I'm buying a camera in a shop. I might expect the seller to spend some time educating me about the particular camera available, the things I can do with them and their specific features. Sometimes the seller leads, sometimes I do by asking questions, or making statements. The conversation ebbs and flows between us. Computers are clearly far from this, but consider something simpler like ordering a cup of coffee. Infact think of all the ways that a coffee was ordered around the world today. Now you have some sense of the complexity of even this micro interaction. One of many hundreds, perhaps thousands during our day.

So what will we see tomorrow?

A system wide speech command and control system that will extend to key apps and a few external apps. It will have templated dialogues for known domains and a basic API for developers to apply speech facilities to their own apps. There may be a limited attempt at integrating an application's capabilities, "I want to check in" [to 4-square] into the system. These will be accessed with a common button- perhaps a long press on home.

It will have a wearing and mechanical feel- frustrating if the recognition goes awry and with no ability to recover broken dialogues. A good trick, and one that gets usage, but ultimately has a feeling of being half-done. It will be an important milestone along the road of talking usefully with computers, but fall short of the techniques we will use and the interfaces we will need for talking to computers in general.

I'll stick my neck out and say that I think the personal assistant thing will be a flub, like Ping. It's not the idea which is a problem- it's a fine idea, but that the implementation is too brittle and has many integration difficulties. I think Norman Winosky has been overstating its capabilities, and he's been out of the product loop for a while now. You can learn about some of their thinking here.

On disintermediating Google. Apple has shown little capability in the mass data sphere (social, advertising) preferring scaled versions of curatorship like the app store approvals process. I expect them to continue this: speech on the iPhone will give you access to a small number of services rather than the web in general. Though they do have jobs advertising for, "Demonstrated experience with Nuance Recognizer, IBM WebSphere Voice, Google Voice, or similar voice search tool". That last bit is a bit mixed up, but let's assume they mean Google Voice Search. Perhaps in a future version.

 

20 years of loading spinner

I hate waiting.

In terms of interaction design the web doesn't do anything. It just puts your data on the other side of the planet, certainly further than the other side of the room.

We used to wait for CPUs and memory and hard disks. Now we wait for the network too. And so, as with HTML5, we wait to go back to the future (1985).

Of course it's a trade-off. Each time you do a Google search you invoke massive power, more than you could ever afford the geeks will point out*

But that's missing the point. The remote UI will rarely be faster than a local UI. Check yourself - you're using a local text editor**, local web page renderer, local file system GUI. In fact everything you interact with is really local. Remote sucks (use a ssh or more primitive terminal to a really remote machine and you will prove this easily).

I should point out that this doesn't make the web much less interesting or that it will die anytime soon. But it does tell us that we can advance interaction design locally as well as on the web, and gives some considerable support to the app phenomena.

I do think that apps and the web will fuse as this consideration is taken into account. Fast start, local storage, cached results, cached code for that matter, fast local graphics are all advantages that the desktop, phones and other devices have. 

Whatever development trends there are in hardware point this way too. Moore's law continues to outstrip network development. I have little idea why it is so retarded, but there it is. Even in the best case networks have a hard limit of the speed of light.

The time taken for a photon to travel from NY to SF is approx 20ms. Your server may be closer, but taking technology into account (like optical fibre and switches) it's not *that* much closer. http://www.wolframalpha.com/input/?i=san+francisco+to+new+york

Dear ol' Cray's line was that it only takes 1ns for an electron to travel a foot for comparison. That's DVI, HDMI, USB BlueTooth, CPU<->Memory and so on. 

What's the point? We are still waiting for a web that can never catch up. We should think in terms of internet + local clients (apps). The web renderer/browser is one of those, and fair generic workhorse, but it's nothing special and a pretty lousy GUI client. What other requires the intersection of 3 languages to make it work?

I mean to say that GUI development appears to be a Cinderella from the 80's, when in fact it is a relevant today as it was 15 years ago. Finding form, good programming paradigms, faster, better and surer ways of making great interfaces remains as live a problem today as it was in 1995.

Your user is right here, their data is over there. It's an interesting problem, but it's not the only one in town.

 

* This is clearly not true, you may well be able to afford a fractional amount of an Intel CPU (even > 1.0) to do some exciting work for you. Mass searches have to be cheap after all.

 

** The HTML text editor has been pushed to evolve to the point where it's a half decent text editor, but it's still a local GUI control. This is not a success of the web, it is the continuing success of the local GUI for interaction.

 

A couple of Apple predictions for 2011

In the spirit of a somewhat scientific approach to prediction, let me nail my colors to the mast and thus record:

1. Apple will introduce a new smaller form factor phone to address the middle of the mass market. This product will wrongfoot a lot of people who believe in progress through the (open) web on-the-go. It will also put paid to the idea that Apple will always be marginalized to the premium end of the market.

2. Later in the year they will lay the iMac down and make it multitouch. They'll introduce a lot of the iOS touch motifs to make this interface work, but the design will be confused because the keyboard is still in the way (of progress with desktop computing).

 

Hey Steve, the keyboard's in the way!

I was reminded of this again when I saw the Air unveiling. One can move it around, make it smaller. Make disappear until we need it - pretty hard with physical objects. But mostly it's in the way.

OK, myths and other stones that haven't been turned over:

1. You need a keyboard to hold up your (laptop) screen.

No one doesn't. The iPad cases demonstrate this.

2. You need a keyboard

Not a lot of the time. You need to be able to create words and text. Occasionally one needs to be able to originate a lot of text or do some editing.

0. We need real keyboard because speech doesn't work.

True now, but that doesn't change that speech is primary (over typing) for most humans. That's certainly true in their houses, though not perhaps so much in the street, on the bus, at work or in their boudoirs. Also note that speech is mostly primary in the street, certainly in cars and lot of workplaces have people who do business on the phone all day long. This is a long, but inevitable project.

3. The keyboard is as small as it can be, and still be comfortable.

Even the small wireless keyboard has huge hulking batteries in it and is about 4mm thicker than it needs to be to keep the physical sensation of typing.

4. The keyboard has to go in front of the screen, just beneath it.

A smaller keyboard could go _on top_ of the screen if the screen was lowered and slanted to the right angle. We already know we don't need a real physical keyboard most of the time.

There's another opportunity here: if the screen knows that is has a keyboard on it and where, then the text can originate in the right place - just above the keyboard, like a typewriter or Jef Raskin's Cat.

Oh and we can use induction to keep it charged. We pretty much always know where it is, and we can find places to put it otherwise - locators on the base of the iMac or riff off of the magnetic holder for the remote and iSight.

 

Note to self

"What is the spiritual meaning of design?"

- Stewart Walker's paper at IDATER 99, "How the other half lives -
product design, sustainability and the human spirit".
http://www.lboro.ac.uk/departments/cd/research/idater/downloads99/walker99.pdf
- Stewart himself, http://www.ucalgary.ca/evds/walker
- Fraser Speirs on the iPad, http://speirs.org/blog/2010/1/29/future-shock.html
- And this Drucker quote, http://krmmalik.posterous.com/drucker-on-apple

Not necessarily related ;-)

Guessing at the Apple Tablet

Why bother someone asked? Because I think we are witnessing an
important inflection in the development of computing. It's also fun to
measure ourselves against Apple's brilliance; how surprised are we by
the result, how good are our predictions? Here are mine:

I think the most important consequence of this device will be an extra
boost to the process of application redesign that has been ongoing
since the introduction of the iPhone multitouch screen.

Specifically I think it's the direct interaction (think FlightControl)
which is important about these devices rather than multitouch.

The hardware has been discussed a lot, so let's have a think about the
interface.

Beefier, faster device so we perhaps application multitasking of a
handful of applications. That means there needs to be some visual
device for switching between apps. Or perhaps a button and an exposé
like affair.

Managing windows is a pain, so I expect all apps to be full screen.
Some areas like the switcher, the status indicator and in most apps
nav bars will be walled-off by convention as before.

There will be a new larger content area and this will have the most
long term impact. This screen will have enough space to be useful for
practical tasks as well as casual gaming and media consumption.

What apps will we see? I don't know, but I'm sure the community will
have a good go at everything in time as they have with the iPhone and
this device will make a wider gammut possible too. I think it unlikely
that Apple with go with an iWork like suite early on, but there may be
a few little surprises. Photos will be expanded to the point where it
could usher in a nascent iLife suite.

As I walked the length of a train the other day I saw what people were
doing with their laptops. A good amount of media, some document
review; wordprocessing and spreadsheets. Email of course. A few people
were using sophisticated creation apps like logic, flash or an IDE.

It's instructive to ponder how any of these applications might be
redesigned for the hour long journey on the train where one is mostly
reading, reviewing, marking-up and editing. Spreadsheets needn't
involve a lot a great deal of text entry.

Once the door is open I expect a great deal of experimentation within
the envelope of the device; let's say 140 characters is about the
limit of text one can be bothered to type on it ;-)

API wise we can be sure we'll see the same Darwin/CocoaTouch layers as
before, simply a new shell.

Also the Application model still be in place. There's no other
prevalent model for building software and Apple is going great guns
with the AppStore

A few words about the hardware. We can assume the focus of the device
will be media, video, music with apps coming in second and books as an
interesting new avenue.

However I expect Apple to go with a fast bright screen suitable for
flicking and browsing rather than a long-life highly-reflective eInk
style screen. It might however adopt something like Pixel-Qi's hybrid
screen. These screens tend to be a bit dull (though practical) and
don't play so well at the point of pickup in a store, Apple's forte.
You wanted it before you walked in the store, picking it up just seals
the deal. So I expect the screen to major on 'gorgeousness' and less
on energy efficiency and reading practicality.

Concluding, this device will be much more interesting than we might
guess at. I expect it to usher in an era of application redesign to a
more intimate and direct experience than the desktop. This will also
be orthogonal to the local/gears/browser/cloud transition that is
going on.

From the perpetual future

http://hplusmagazine.com/articles/toys-tools/micro-machines-and-opto-electron...

O.K. So how does one make a virtual image at a distance from the
viewer with this? Right now, the LEDs sit on the surface of the
contact and would just appear as a bright blob. Microlenses within the
contact? Even today's smallest pixels would be huge. I'm trying to
think where I've seen this before...

Oh here we go, http://www.hitl.washington.edu/projects/common/papers.php?idx=1.
From the Human Interface Technology Lab at UW.

I wonder how far out the multispectral deformable dynamic
micro-lensing technology we need for this is out - 50 years perhaps?

pg on Apple's mistake

Now this is interesting, and I'll riff off of it in more detail elsewhere, but I wonder whether Paul's rear-view mirror has a rosy tint: Apple has always had a difficult, off-on, love-you/ignore-you relationship with its developers. Back in the day when Develop! magazine would come in the airmail people were complaining about Apple's treatment of devs. Keeping them "close", but being super-secret all the same, mercurial one might say. Wonder where they get that from ;-)

Approaching the main point of the mobile platform I note or postulate

• It's a platform and platform is still important. The cabal of web browsers is one. Mobile is another. 

• For there to be change there either has to be a moral movement inside Apple (unlikely) or an adequate competitor. 

• Devs deserting Apple's working platform in any meaningful number is unlikely

• Palm or another getting a web-style, or open development process in place is also unlikely. 

• In a fair fight between Apple & Google over consumer electronics, Apple would prevail. 

• Google's offering will never be homogenous enough to offer a great experience and hence adequate competitor. 

Yes, devs may be hacked-off, but that's not enough to materially change the landscape. We've been here before, and analogous to the
illusory era before the current financial calamity, the world has not changed that much.