Download PDF version of this article PDF

A Conversation with Jordan Cohen

Speaking out about speech technology

Jordan Cohen calls himself ”sort of an engineer and sort of a linguist.” This diverse background has been the foundation for his long history working with speech technology, including almost 30 years with government agencies, with a little time out in the middle to work in IBM’s speech recognition group. Until recently he was the chief technology officer of VoiceSignal, a company that does voice-based user interfaces for mobile devices. VoiceSignal has a significant presence in the cellphone industry, with its software running on between 60 and 100 million cellphones. Cohen has just joined SRI International as a senior scientist. He will be working on government contracts as well as other ventures.

We recently got Cohen to pause long enough to share his thoughts on speech technology and its potential for home, automobile, doctors’ offices, and especially, cellphones. What will make the difference for its ultimate acceptance in the marketplace?

Cohen explores the answer to this question with another expert in the field, John Canny, professor of computer science at U.C. Berkeley. He began working with speech and signal processing while earning his undergraduate degree, then went into the areas of computer vision and robotics until the late 1990s. He has moved gradually into HCI (human-computer interaction), and lately has become interested in mobile devices. That has led him to initiate a couple of projects on speech, which his group has identified as one of the key technologies for HCI on mobile devices. His work also includes projects on context-awareness, computer-assisted education, collaborative work, and technology for developing regions.

Also participating in the discussion is Wendy Kellogg, whose background is in cognitive psychology. She has been working in the areas of HCI and computer-supported cooperative work for about 20 years at IBM’s Thomas J. Watson Research Lab. There she manages the Social Computing Group, which is working on novel approaches to representing people and their activities in online spaces to facilitate remote collaboration. Kellogg is also a member of the Queue editorial board.

JOHN CANNY Speech recognition, at least from some reports, seems to be doing well—certainly, VoiceSignal has been doing very well. Could you tell us a little bit about the rise of VoiceSignal?

JORDAN COHEN VoiceSignal took a very interesting tack when I joined about five years ago. We looked around and asked ourselves where was the market without asking what was the technology capable of doing. So the emerging cellphone market—don’t forget this was 2000—actually looked like a real market. It was a place where the interface was deficient in general, and it kept getting smaller and smaller, but people’s hands did not. And it looked like there would be about a billion new cellphones a year. So it turns out that a billion times any amount of money is actually money.

We had all of the pieces for making a real market, and then we set about building a technology that actually fit the market. The technology bent was to build speech recognition so it fit into the kinds of processors that were available in cellphones, not counting the DSP (digital signal processor).

The other important piece of making a business work is to find a sales team that will actually get down and dirty and go find customers. We decided to target large telecoms, and it turns out that meant targeting handset manufacturers, which was a long and torturous path. We needed to find people who would dig up those customers, figure out how they worked, what their economics were, and who the movers were inside of those companies. That culminated in a pretty successful company, which is now looking forward to new applications on cellphones beyond voice dialing and transcription.

The VoiceSignal folks have had a discrete recognizer with a vocabulary of tens of thousands of words on the market for more than a year and just announced a continuous recognizer with a similarly large vocabulary in the handset. If you have environments in which the vocabulary is substantially larger, then what you really want is server-based recognition, and you may want to use a substantial amount of natural language, but natural language processing isn’t at the point where you can support it in these very small computer platforms.

So you want to have some handshaking between the local recognition or the local device and the remote server, and I think that’s a place that’s just coming to light where people are starting to pay serious attention. This is an area that has more problems than solutions, but it’s clearly something we have to handle sooner or later, to provide an interface with systems that are telecom-based today.

Desktop dictation, except for the group of people who have trouble dealing with keyboards, is an interesting application but it’s not a market. The reason it’s not a market is because the existing keyboard, mouse, and screen interface is so good that being competitive with it is really a tough job. I don’t think anybody is going to do it for a long time.

CANNY I agree. Let’s turn to some other emerging markets, stepping a little bit outside the cellphone without going too far. For, say, large-vocabulary tasks, the home is an interesting market, as is the automobile. Another one is the medical market where doctors carry around small recording devices for physician order entry. Do you see a role in the future for, if not quite cellphone-embedded technology, then embedded voice recognition or dialogue-based systems for those markets?

COHEN There’s an interesting wild card in this whole schema, although not so much in the medical market, but certainly with home and automobile. That wild card is that cellphones are becoming much more connected and they’re becoming IP platforms. We’re seeing handsets now that support Wi-Fi and more complicated protocols.

An automobile has a very challenging acoustic environment for speech recognition. We’re starting to see manufacturers like Daimler putting more than one microphone in a car so they can do some array processing.

The product cycles are very long—five to seven years—so the product that finally hits the street has a five-year-old speech recognizer in it. That seems to be a problem. I think ultimately it gets in the way of making great advances in automotive telematics.

You need either to have a replaceable device—something that you can plug in—or to do the speech in your car through a cellphone that has a speech recognizer from last week. The interesting story here is Bluetooth. As more and more automobiles have Bluetooth, Bluetooth connectivity with the cellphone is essentially automatic.

The dark horse here is not the general telematics companies, but it’s the cellphone that is the standard mediator of speech recognition. There’s always the server-based stuff, but that’s just on the other end of the network. You can get to that network with your car radio or your cellphone radio or any way you like, given that the network is working. There’s always that connectivity question about the network. Connectivity is much better in Europe and Asia than it is in the United States.

In the home, there’s sort of the same story. No one, as far as I can tell, has quite gotten a handle on home automation. More and more devices are IP-aware. They have connectivity—typically broadband or telephone connectivity—but again, your cellphone is IP-aware in the same way. So I think that’s the dark horse here. You may start to see speech in remote devices.

There’s always a problem with acoustics in the home, so you want to talk close to the thing. If you have a remote that’s doing speech, you can’t leave it on because the battery will run down, and you don’t want it plugged in because that’s a pain in the butt.

I think that’s the secret problem in these speech-enabled things. Either you need to do very sophisticated room acoustics to solve speech recognition problems or you end up using the cellphone in your pocket, and all of those prior management problems have been solved.

CANNY People routinely walk around now with headsets, which provide a pretty good location for speech recognition, and they are comfortable doing that either with Bluetooth headsets or with corded headsets into the cellphone, so it could easily become a general interface device. I would agree with you, at least in the near term, that cellphones are a very good mediator in that space.

COHEN We solve lots of technical problems that don’t have anything to do with speech technology. They have to do with getting mobile devices to work reliably and for the batteries to last a long time and for placement of the microphones to be in a sensible location so you get good audio. Cellphones have good audio systems, and they’re cheap because the economies of scale really work.

CANNY There was a boom in speech technology globally around 2000. I think the market roughly doubled in a very short time and then contracted back down. Hopefully we’re going to see some growth again, but what should people worry about in terms of avoiding another bust cycle?

COHEN That bust cycle was pretty interesting. Part of the problem was that people focused on markets that weren’t markets—namely, desktop dictation.

IBM also made an aggressive move in the late 1990s to undercut the prices of all the other folks in the marketplace, and it did a pretty good job at making it impossible to make any money in that market—certainly, the dictation market.

It has been a pretty tough environment in which to have control, and I’m not sure we’re going to see the same thing again, but there needs to be a little bit of tolerance in that marketplace because it’s always going to be vulnerable.

The barriers to entry in embedded systems are pretty substantial, but the barriers to entry in consumer-based devices are not. So you do have to worry about that.

CANNY As you said at the beginning of the discussion, the trick is to figure out what the market is first before you dive in with the technology.

COHEN Yes, absolutely.

CANNY Remaining in the context of cellphones, what new services or improvements to existing services do you think speech can provide?

COHEN It’s a pretty good interface compared with the buttons and the very small keyboards that you have on these devices. It’s a good general interface. The thing that’s going to make it exciting are the new services that are coming out, among them are location-based services, for either push or pull kinds of information. We’re seeing social networking services, which are a great hit among the younger generation.

The killer application is probably going to end up being some kind of interface with search, which seems to be the very hot topic in the world today; for mobile search especially, speech is a pretty reasonable interface, at least for the input side of it. The output side poses a whole different question.

But dealing with search means you need to support a very large vocabulary so people can ask anything they want to. You really need to have a feel for how to do that, but there’s a hidden interesting and complicating factor: language. What we have now is a world with about 6,000 different languages, but 98 percent of the population speaks only 40 or 50 of those. Having said that, however, I must say that anybody who wants to be really successful has to cover that language base, and that’s a tremendous undertaking.

CANNY Could you talk about where you see speech technology going? For example, VoiceSignal had a vision and has been articulating more natural kinds of social interfaces. What do you think will be available in the next few years?

COHEN It’s a little bit hard to predict, but there are two things happening. One is that companies like VoiceSignal are trying to pay lots of attention to the user experience, and that turns out to be a key driver. We have to make things that are natural for people to use. It’s essentially a human factors issue.

The other thing that’s happening is that the amounts of available computing and memory are exploding. The new ARM11 processors are quite substantial—essentially the equivalent of your PC—and we’re starting to see very small, very high-capacity memory, so it’s easy to envision a 10-gigabyte phone.

So that’s not going to be the technology impediment. The technology impediment is going to be how smart we are at implementing things.

One question is, what’s going to drive that technology to make the breakthroughs that are sensible? I’m involved in the GALE (Global Autonomous Language Exploitation) program for DARPA, which is about driving both translation and what they call distillation, making sense out of data in a multilingual environment. As soon as you have a multilingual environment, you are forced to deal with these issues—about what they mean and how they relate.

It may be that we won’t have success until that technology matures, and I think there are some real drivers from the government side to push money into that space.

CANNY For developers wanting to use speech, either on mobile platforms or otherwise, what tools or platforms should they be looking at?

COHEN There’s a whole series of platforms. A lot of open computing is mobile, such as PDAs. Smartphones are an increasing part of the market, and those smartphones have open operating systems. There are two basic developer platforms that are supported in the United States. For CDMA phones, there is BREW (Binary Runtime Environment for Wireless). BREW is a developer-friendly platform where people can do development and then Qualcomm will be their bank and actually make it available. The other platform is Java; almost every cellphone platform today supports Java in one way or another.

It’s not entirely clear that you can write speech recognition engines in either of these platforms. I guess it’s remotely possible. I’m not sure there’s enough compute power to do it, but if there’s a speech recognition engine, you can certainly look for APIs that are interfaced to both BREW and Java that allow you access to those things. I think that’s where the action is going to be.

CANNY What about commercial products from the developer’s point of view? Do you think there will be opportunities there?

COHEN The opportunities are going to be in services. There is an active developer community. Certainly Nokia is pushing its Series 60 and Series 80 devices, which are open, and Microsoft is increasingly making inroads, providing Windows access on PDAs, cellphones, and smartphones. Symbian also has an open, slightly different platform. So there are lots and lots of opportunities to develop things in this space.

The dark horse here may be the gaming industry because we’re starting to see serious efforts to make speech an essential part of some games. That design problem, the design of the game, is pretty interesting, but we’re starting to see some signs of solutions, and that may actually drive the technology.

CANNY What are some of the challenges for developers who want to implement speech on phones? What should they be worrying about?

COHEN Usability is really the key issue. In many kinds of applications there are multiple input modes. We certainly see that on the phone. The trick for a developer is to find a place where the speech is of some finite value to the customers, where the value is more than they get from the application without the speech.

It’s easy to overlook that. You can fool yourself into believing you have something spectacular, but you really need to talk to customers and do tests and make sure that what you’re doing really has added value for them. If you don’t do that, then you don’t have a business.

WENDY KELLOGG I’ve been a long-standing skeptic of speech recognition, probably as a result of having grown up with “Star Trek,” where it works perfectly. But is it not the case that the state of the art for continuous speech recognition is still very far from that fluid ideal that people often have in mind?

COHEN That’s true, and I must say, companies such as VoiceSignal haven’t fixed any of that. It hasn’t been pushing the basic technology; it has just been doing really smart engineering.

KELLOGG On the other hand, specific services that drive people to use the technology will help to drive better and better recognition, and certainly the capability of putting the technology in cellphones.

I think the multilingual uses of speech recognition—which sounds completely impossible, we can’t even do one language—are certainly very enticing. Even simple translation on the Web is very useful.

COHEN They are enticing, and there’s a real market there. There is even a commercial market for people traveling.

The standard speech-to-speech translation systems are phrase-based, so you either have to know or guess a phrase in one language and the system will translate it into the other language. That’s sort of a hack, but it actually gives you some capability. The modern systems actually try to do translation, and they’re modestly successful with a whole bunch of compute. We won’t see those for a while.

KELLOGG Would the translation be server-based for the foreseeable future?

COHEN If you wanted to do it today, the answer would be yes. If a company such as VoiceSignal were to take a whack at that, it could probably do a reasonable job at mimicking the state of the art. Having said that, however, I think the state of the art is not very good, so there is definitely a problem there.

KELLOGG How cool would it be to have a Babelfish application, drawing on the Hitchhiker’s Guide to the Galaxy? That was an absolute necessity for Douglas Adams to write his story, but wouldn’t it be great to wear your cellphone headset and walk around a country where you don’t speak the language and be able to understand something about what’s being said around you?

COHEN The first person to do that is going to make a lot of money.

KELLOGG What are the cooler speech recognition games that are out there right now?

COHEN I’ve seen some examples, but I’m not in love with any of them. The people who have made the most noise about this is a company called Fonix, which has a recognition toolkit available with the Xbox. There have been some Xbox developers developing games where the speech that you are allowed to say is limited to three or four or five phrases, and it’s on the screen so you know exactly what the possibilities are. They tell me that the gaming experience is terrific.

CANNY One reason I’m getting involved in speech is because I’ve run into a lot of people telling me they need speech, specifically in the medical community. From their perspective, medical practitioners don’t really take desktop or even laptop or mobile computing very seriously. They use voice for a lot of routine communication, note-taking, etc. They are already adopting speech technology pretty fast, and certainly they would adopt it faster if it were smaller and more convenient.

Like Wendy, I’ve been skeptical, but at the same time, a lot of people now have the newer phones with voice dial or voice lookup, and they use it, and they like it. They don’t know the history of speech processing.

COHEN There’s an interesting quandary in the medical use of speech having to do with doctors. You can develop a speech application that doctors are direct users of. Unfortunately, that also makes them the editors. They are in the position of having to say whether the result that came out of the speech input is right or not.

There are some companies now doing speech recognition behind the doctor’s back—for example, supporting transcription services. That looks like a terrific business.

Here’s another application of that technology: a company called provided approximate transcripts for hearings in Washington, D.C. Lots of hearings happen all the time, and they are impossible to get information about because the transcripts are published several months later.

HearingRoom arranged to have all of the hearing rooms fitted with microphones and employed a group of “retranscriptionists,” who would re-speak what they heard into a speech recognition device, thereby making about 95 percent transcripts and getting about 12-minute turnarounds for these sessions. Its market was the legal community in Washington, DC.

CANNY What general suggestions would you give to people interested in or thinking about speech technology? I think a lot of people in the community are skeptical at this point, or probably quite curious because obviously they’ve seen the growth in the embedded market, the phone market. What advice would you give them in terms of interface speech development?

COHEN The crucial issue is finding an application where speech adds value and has a real market—that is, people are willing to spend money for it. That’s going to drive absolutely everything, and if you don’t have it, then you’re just wasting your time. Speech gets better and better, so I think there’s going to be a bright future. We’re starting to see multimodal interfaces, and I must say part of VoiceSignal’s capability is to do speech synthesis in the handset as well. It’s more than just speech recognition. It’s really dealing with the interface in a sensible way—the text and the speech and the output—and getting it all to work in a fashion where the user is least bothered and where it is most helpful.

CANNY That raises an interesting question. An article in Speech Technology magazine compared speech interface design to visual interface design in the sense that whatever you do creates a user experience, an impression, rather like the visual design of a Web site.

There’s no neutral interface. There’s no neutral Web site. If you use simple Roman font ASCII text, you create a certain impression of the site—probably very bare bones, minimalist. Speech is the same. If you have a monotone computer-synthesized voice that you interact with, it’s not neutral. It creates a user experience aside from the content that goes into the dialogue design. Speech design, the design of the whole experience, seems to be quite a complicated process.

COHEN In the DARPA project a decade ago, people discovered that nontechnical people’s impressions of a system—an airline reservation system, for example—depended tremendously upon the quality of the synthesis that was used and had very little to do with the quality of the speech recognition.

The quality of speech synthesis is going to be the equivalent of the quality of the speech recognition, and that’s going to help a lot in getting people to accept these applications.

KELLOGG You talked about the usefulness of transcripts because speech has inconvenient characteristics. It’s not easy to scan quickly. You can compress it when you play it back, but it is hard. You can do things with text that you obviously can’t do with any kind of speech. So there’s the input side where people try to digest speech.

But there’s also the output side. We’ve been building a system in our lab that uses an IVR (interactive voice response), which constructs things from synthesized speech, recorded speech, and bits of information. That’s how you can make speech more computable, if you will, or programmable. I was wondering if you have any thoughts about that, or where things are going in the future? It seems like people are going to want to hack up speech as it becomes more available in various ways. Are people thinking about this or building tools for making it easier to mash up speech?

COHEN We see a bunch of toolkits like the ones that come out of Microsoft for using speech, and my sense is that they’re tremendously complicated and they give you access to absolutely everything—which you don’t want. I think what you’re going to see in the future are simple toolkits that allow you just the amount of access you need as a developer.

There’s a quality issue with speech synthesis and recorded speech. One of the hidden pieces of that quality is that as the environment gets noisy, the amount of cognitive load that you need to decrypt the speech synthesis goes up tremendously, and for bad-quality speech synthesis it explodes. You need to fix the quality before you have usable interfaces in any kind of noisy environment.

CANNY Do you feel that VoiceXML is a reasonable intermediate standard for writing speech interfaces?

COHEN It’s a way to get your arms around the load. It seems to work OK in the IVR space. I’m not sure that I would want to write lots of stuff in VoiceXML. It assumes a lot of things about what the system is doing behind you. It assumes you’re on an IVR system. It assumes that you’re talking over the telephone and that’s the only interface so you don’t have other kinds of information.

It’s a pretty limited interface. It has served the community pretty well because it did provide a standard, but I don’t think it’s going to last.

This is going to be an interesting time. I don’t think there is going to be any more government support for speech as an engineering technology. There’s going to be support for applications. I’m looking for somebody to figure out how to make speech into a scientific endeavor, and I think when we do that, we’ll actually learn a lot more about speech and language. We’re going to be an application space for quite a long time.

KELLOGG That can be very inspiring, certainly fun for consumers and users.

COHEN Oh yes. Even imperfect technologies allow you to make money if you do the right thing.


Originally published in Queue vol. 4, no. 6
Comment on this article in the ACM Digital Library

More related articles:

Arvind Narayanan, Arunesh Mathur, Marshini Chetty, Mihir Kshirsagar - Dark Patterns: Past, Present, and Future
Dark patterns are an abuse of the tremendous power that designers hold in their hands. As public awareness of dark patterns grows, so does the potential fallout. Journalists and academics have been scrutinizing dark patterns, and the backlash from these exposures can destroy brand reputations and bring companies under the lenses of regulators. Design is power. In the past decade, software engineers have had to confront the fact that the power they hold comes with responsibilities to users and to society. In this decade, it is time for designers to learn this lesson as well.

Kari Pulli, Anatoly Baksheev, Kirill Kornyakov, Victor Eruhimov - Realtime Computer Vision with OpenCV
Computer vision is a rapidly growing field devoted to analyzing, modifying, and high-level understanding of images. Its objective is to determine what is happening in front of a camera and use that understanding to control a computer or robotic system, or to provide people with new images that are more informative or aesthetically pleasing than the original camera images. Application areas for computer-vision technology include video surveillance, biometrics, automotive, photography, movie production, Web search, medicine, augmented reality gaming, new user interfaces, and many more.

Julian Harty - Finding Usability Bugs with Automated Tests
Ideally, all software should be easy to use and accessible for a wide range of people; however, even software that appears to be modern and intuitive often falls short of the most basic usability and accessibility goals. Why does this happen? One reason is that sometimes our designs look appealing so we skip the step of testing their usability and accessibility; all in the interest of speed, reducing costs, and competitive advantage.

Jim Christensen, Jeremy Sussman, Stephen Levy, William E. Bennett, Tracee Vetting Wolf, Wendy A. Kellogg - Too Much Information
As mobile computing devices and a variety of sensors become ubiquitous, new resources for applications and services - often collectively referred to under the rubric of context-aware computing - are becoming available to designers and developers. In this article, we consider the potential benefits and issues that arise from leveraging context awareness in new communication services that include the convergence of VoIP (voice over IP) and traditional information technology.

© ACM, Inc. All Rights Reserved.