The story of speech recognition so far
Siri is a breakthrough in speech recognition and control, but we've been talking to computers since the 1950s, with varying degrees of success - and machines have been talking back to us since the 1930s. The story of how we’ve made computers (and phones and cars, and soon home appliances) listen and talk is as fascinating as the results, with scandal and fraud as well as breakthroughs; military history and specialized - and secret - tools for the intelligence service are as much a part of it as university and commercial labs. Hollywood showed futuristic talking computers decades before they were possible but the music and film industry has also pushed developments in artificial speech and sound. Here’s how we got to Siri – and beyond.
Ancient history: talking heads and speaking machines
Legends of medieval scholars with heads of bronze that not only spoke but told the future range from Roger Bacon to Faust; Bacon’s mechanism appears in an early science fiction story as The Brazen Android. Christian Kratzenstein built the first real speaking machine in 1773 using tubes and organ pipes to make artificial vocal chords. It only produced vowels but in 1791 Wolfgang von Kempelen (maker of the chess-playing Mechanical Turk – with a human player inside) created one with two resonating tubes; manipulating these in either hand, he made it ‘speak’ whole words. And in 1845 Joseph Faber toured his Euphonia speaking machine to audiences including the father of Alexander Graham Bell.
The 1930s bring the first artificial voice
The first electronic speech synthesizer, the VODER (Voice Operating DEmonstratoR) built by Homer Dudley at Bell Labs, was difficult to operate and wasn’t that easy to understand when it said “Good evening, radio audience” at the 1939 New York World’s Fair – but it was good enough to have the New York Times declare “My God, it talks.” In 1936 the UK telephone service introduced the Speaking Clock with words and phrases that were joined into sentences. Like the 1950 Pattern Playback machine from Haskins Labs it used images of speech, stored as spectrograms, read by light and played back. SOUND LINKS The VODER speaks The Pattern Playback speaks
The Vocoder: a sidestep into music and film
Dudley also developed the Vocoder, a speech compression system originally designed to save bandwidth on telephone networks (think of it as the physical equivalent of the Skype codec; compressing speech by running it through a series of filters and playing it back through another set of filters so you can send a much smaller signal). It was never used in telephones but became popular with musicians like Wendy Carlos, who used it on the soundtrack of Clockwork Orange. Many effects you think are a Vocoder – like the talking train in Dumbo or Sparky’s magic Piano – use the Sonovox instead. This has two discs you press against your throat; mouthing the words makes the discs vibrate, producing a similar effect.
The 1950s and the beginning of recognition
Bell Labs was working on recognizing speech as well as just transmitting it and in 1952 researchers created Audrey, the Automatic Digit Recognizer, which could recognize individual spoken numbers with up to 99% accuracy. That was if it was a man speaking, with a distinct pause after each number, not using any words other than one through nine (and ‘oh’ for zero). Audrey had to be tuned for each speaker, who had to record enough samples for the system to store twenty typical patterns for each number in its analog memory. What Audrey was listening for were the ‘formants’ – the two peaks in the frequency of speech that are enough for us to tell vowels apart.
IBM’s Shoebox: speech and sums in 1962
By the 1962 World’s Fair, IBM was showing off a smaller and more powerful system called Shoebox (because of the size and shape of the wooden case) that worked as a voice-controlled calculator. Shoebox recognized all ten numbers plus six commands including plus, minus and total so you could speak math problems into the microphone and get the results printed out on the manual calculator it was linked to. Each number recognized was displayed in lights so you could check it was getting the numbers right every time.
1961: the singing computer that inspired HAL9000
Speech in science fiction switched from repetitive monotones (“Warning, Will Robinson!”) to natural and realistic. The computer in Star Trek (voiced by Majel Barrett), C3PO in Star Wars – and of course, HAL 9000 in 2001. In the early 1960s Arthur C Clarke was visiting Bell Labs when John Kelly and Carol Lockbaum programmed an IBM 704 – the first mass-produced computer with floating point arithmetic and core memory rather than vacuum tubes – to sing Daisy Bell (better known as A Bicycle Built For Two). It was later released as Music from Mathematics. Impressed, Clarke added it to his screenplay and Stanley Kubrick used an Eltro Mark II audio processor to change the pitch and speed of actor Douglas Rain’s voice as he sang Daisy, Daisy. SOUND LINK IBM sings Daisy Bell
The 1970s: recognition but not real-time
Clarke was at Bell Labs to visit executive director John Pierce who decided in 1969 that general purpose speech recognition was as likely as “curing cancer or going to the moon” and stopped working on it. DARPA funded the Speech Understanding Research program in 1971. The most successful project was Carnegie Mellon’s Harpy that was 95% accurate recognizing continuous speech with a vocabulary of 1,011 words. However it needed training, it took 80 times longer to recognize your sentence than it took you to speak it and it only worked with words in specific order. IBM’s voice-activated typewriter – connected to an IBM 370 computer – also had a thousand word vocabulary but it took an hour to process a sentence. Still the 1970s saw the first commercial speech recognition company; Threshold sold its VIP-100 system to FedEx for sorting package on a conveyer belt.
1978: the Speak-N-Spell brings voice synthesis in toys and games
Talking toys showed up in 1960 with the Chatty Cathy doll, but the pull-string wound up a tiny record player inside the doll. Texas Instrument’s Speak-N-Spell, launched at the 1978 CES, was the first commercial use of digital sound processing and featured the first speech synthesis done on a single silicon chip. It began as a $25,000 research project to find something that would show the power of TI's bubble memory research project (speech data needed a lot of storage) but it actually used two 128K ROMs. The first speaking chess computer followed in 1979; in 1980 Milton was the first multi-player electronic game with voice synthesis – used to insult the players. ‘Talking’ arcade games Stratovox and Bezerk came out the same year SOUND LINK The Speak-N-Spell
Early talking personal computers
From 1978 TI also offered a speech synthesizer peripheral for its TI-99/4 and 4a home computers, often bundled free with video game cartridges that used speech, like Alpiner and Parsec. The plan was to sell extra cartridges to expand the small built-in vocabulary but the software speech to text in the Terminal Emulator II cartridge turned out to be good enough. UNIX had text to speech in 1972 if you had the right hardware but in 1983 Atari used a specialized chip for text to speech synthesis in the 1400XL home computer. In 1984 the Mac launched with the software-only MacInTalk speech and in 1985 the Amiga included speech synthesis developed by the same software company, SoftVoice (which originally called it Soft Automatic Mouth).
1980s: the birth of today's speech recognition companies
In the 1970s, speech recognition worked by brute force, trying to match each word individually, one at a time; in the 1980s most researchers adopted a mathematical technique developed at Princeton in the 1960s, called Hidden Markov Modeling, that worked out the probability that a sound was a specific word. 1982 saw the launch of three major companies; Covox, Dragon Systems and Kurzweil. By the mid-1980s or 1990s they all had voice recognition software that ran on a PC rather a mainframe, as did IBM, but it was still only able to handle a few words at a time. Kurzweil could recognize 1,000 words in 1985 and 20,000 words in 1987 but it was 1995 before it was right more than half the time. And software was pricey; in 1990 DragonDictate cost $9,000.
1988 Apple envisions speech technology
Siri was right on time. In 1988, Apple made a video showing how by 2011 you might be able to talk to your computer and have it understand you well enough to carry on a conversation, answer phone calls, summarize voicemail and make reservations for you. Apple called the concept the Knowledge Navigator and it’s far more than speech recognition or synthesis; it’s far closer to the idea of an artificial intelligence or a computerized assistant, complete with bow-tie wearing avatar. And the device is a folding tablet with a touchscreen that you can tap to interrupt the assistant. VIDEO
1990s: automating call centers
By the 1990s voice recognition running on something more powerful than a PC was reliable and accurate enough to use for automating customer service calls. In 1992 AT&T introduced the snappily-named Voice Recognition Call Processing Service, followed by the friendlier How May I Help You system; this included voice dialing and recognizing keywords to route phone calls. In 1996 Nuance built the Voice Broker service for Charles Schwab; it could answer 360 callers wanting quotes on stocks and options at once. Voice Broker was accurate enough that Sears, E*TRADE and UPS soon automated their call centers as well.
Wildfire: the first Siri – in 1994
Apple wasn’t the only one thinking about a voice assistant back then. In the 1990s, Rich Miner – later to be one of the co-founders of Android team – built a service called Wildfire that launched in 1994 and eventually sold to Orange. You called the Wildfire virtual assistant from any phone and gave it short voice commands, to which you got friendly answers in a voice that sounded like a female assistant. You ask for messages and reminders or could say who you wanted to call, saying ‘at home’ or ‘at work’. Wildfire answered the phone for you, asked for the name of the caller and told you before you picked up. The business users who paid for it tended to be big fans but there weren’t enough of them to keep it running.
Fraud and consolidation: how we got today's voice recognition players
In 1997 Kurzweil was sold to a Belgian software company, Lernout & Hauspie (who was working with Microsoft and Dictaphone) and in 1999, Microsoft bought Entropic, who claimed to have “the most accurate speech recognition system” of the time. In 2000 L&H bought Dragon Systems but the company was struggling financially and in 2001 the founders and CEO were arrested for accounting fraud (in 2010 they were finally convicted and sentenced). Scansoft – originally another Kurzweil company - bought the L&H speech recognition technology (which went into Office 2003), snapped up some other speech companies, took over IBM’s ViaVoice in 2003 – and renamed itself Nuance in 2005 when it bought a company of that name that came out of SRI’s Speech Technology and Research labs. (Nuance’s technology is used for Siri).
Free in Windows: voice recognition in Windows XP
From software that wasn’t that accurate, cost thousands of dollars and was aimed at professional users, by 2001 voice recognition had become good enough (80% accurate) and cheap enough to build into Windows. Windows Speech Recognition in XP needed training in the room you planned to use it in, but you could use it to control software as well as dictate documents .It wasn’t in all versions of Windows XP though, just the Tablet PC edition – the thinking was that without a keyboard, you were going to need speech recognition more. (It’s been in all editions since Vista). Since then, the main improvement in PC speech recognition has been reducing the amount of training required and gradually improving accuracy.
State of the art: lawyers and doctors
The heaviest users of dictation software have always been lawyers and doctors; lawyers, presumably because they’re paid by the hour and doctors because they have so many sets of patient notes to write up. Dragon and IBM both had specialist products; IBM MedSpeak was the basis of what became ViaVoice for general users. Law and medicine are also full of terms that you don’t use in everyday speech, and the things that lawyers and doctors say concentrate on specific topics. That means the software has a smaller vocabulary it needs to recognize, which makes it more accurate, making the dedicated voice recognition software very accurate. There are even automated dictation services like BigHand that work with recordings made on a BlackBerry.
Talk to your phone
Voice dialing is nothing new; feature phones had it back in the early 2000s. At first you had to speak individual digits of the phone number but later models could recognize names – although typically up to ten names rather than everyone in your address book, and you had to record the specific names you wanted to recognize up to three times. In 2005 Samsung added voice dictation for text messages as well as voice dialing in the $99 SCH-p-207 handset and by 2007 the cheapest Nokia handsets you could buy had voice dialing as a feature.
Talk to your jet fighter
If you’re a fighter jet pilot hurtling along at Mach 1.8 under 6Gs of acceleration, having to flip buttons, press switches and glace down at instruments in the cockpit could take your attention away from something vital. Since the late 90s, the Department of Defense has been trying out voice recognition instead; experimental voice control systems have been tested in a range of fighter jets including the F-16 and the Harrier AV-8B. Early systems had around 25 controls, starting with tasks that aren’t critical to staying in the air, like selecting radio frequencies. The most advanced system so far is due to go into the F-35 Lightning II to control communication and navigation using a microphone in the pilot’s oxygen mask and a display inside their helmet and it’s being built by SRI International. That’s the research lab both Siri and Nuance came from.
The Phraselator: Military translation from 1999 onward
Once a computer can hear you and speak to you, why can’t it translate for you? In 1999 VoxTec was making laptop-sized translators for DARPA but by 2003 they had them down to the size of a paperback book with up to 3,500 phrases. The latest Phraselator uses the same DynaSpeak voice recognition from SRI that’s going into the F-35 voice control, although instead of using speech synthesis it plays back pre-recorded MP3 files (and stores around 12,000 phrases in multiple languages). The next goal is two-way translation.
The Web learns to talk: 2000 onwards
As the Web became popular, companies found they were building the same information and support systems twice, for the voice recognition software in their call center and for their Web site. In 1999 work started on VoiceXML – a version of HTML for building an automated voice service the same way you write a Web page. These days, many of the call services that let you track packages, order wakeup calls and get directory assistance are actually Web pages that listen and speak.
Google and Microsoft make 411 free
Most 411 directory assistance calls are now powered by voice recognition and speech synthesis; in 2007 AT&T and Verizon were charging $2 or $3 a call when both Google and Microsoft launched free 411 services. Google discontinued its free service in 2010 when it had collected enough voice samples for its machine learning system to work with to create the voice recognition it uses in Android. Microsoft already had that from buying TellMe, as well as better voice synthesis, and Bing 411 (based on the TellMe service is still free (1-800-BING-411). You can get traffic reports, driving directions and weather reports as well as business numbers. SOUND LINK Hear Bing 411 in action
Mobile search starts with voice
Searching from your phone with Siri, Bing or Google: it’s not actually new. In 2008 Yahoo and Microsoft both brought out smartphone search tools with voice search in. Yahoo’s oneSearch included both text and voice search and gave you useful information and links to Web pages. Microsoft’s TellMe app was only for voice search. It used GPS to get your location so results were always local and you could start a search using the green dial button so it was simple enough to use in the car, where 95% of mobile 411 calls were being made that year.
Voicemail you can read: speech recognition
When Apple launched Visual Voicemail with the iPhone in 2007, it was a great way to see your messages all in one place but it wasn’t recognizing speech. SpinVox (another company that’s been acquired by Nuance) was doing voicemail speech recognition in 2003, for messages left on your cellphone, company phone or even Skype. They also had tools for texting and writing blog posts from your phone. Jott had a similar voicemail to text service in 2006; Nuance brought out Voicemail to Text in 2008 then bought Jott as well and now only offers voicemail recognition services for AT&T and Vonage. In 2010, Microsoft added voicemail to text to its Exchange email server, so you can get voicemail transcribed into Outlook.
What phones that don't have Siri can do today
The reason Siri is so engaging is a combination of the voice recognition that Nuance does (in the cloud, with servers far more powerful than the iPhone’s ARM processor), the jokes built into the system and the DARPA-funded artificial intelligence project CALO (Cognitive Assistant that Learns and Organizes) to help it understand ideas as well as words. Other phones have the same quality of voice control for dictating messages; Nuance sells Dragon Dictation for iPhones and Dragon for Email for BlackBerry; it’s all what Ziggy for Windows Phone uses. Windows Phone lets you search Bing or dictate and send text messages and Android has both Voice Search and Voice Actions.
Talk to Kinect: it's not just for jumping
When Microsoft says “you are the controller” it’s not just about waving your hands or playing air guitar. As well as the infrared sensor that tracks your movement, it has a microphone array with echo cancellation (and it subtracts the sounds of the game or movie you’re playing and any mechanical noise like the Xbox fan to make the voice recognition work better). There are multiple microphones so it can track who’s speaking if you’re playing with friends. Voice commands open apps, search Bing and control the DVD drive. Not every Xbox app works with voice recognition, but Kinect Sports: Season 2 has 300 extra voice commands.
Computers learning to decipher our emotions
These days voice recognition in call centers doesn’t just try to decipher your words. Many systems also detect your emotions; if you sound angry, upset or frustrated, you’re more likely to get transferred to a real person more quickly. But synthesized speech rarely expresses much emotion, although the best text-to-speech systems – used for audio books and software readers for the blind – does change the timing and pitch of the words. The problem is that the small chunks of speech recorded by voice actors that the software pieces back together tend to be deliberately neutral. Several projects are trying to make speech synthesis more expressive; VivoText and emoSyn both synthesize words in a tone of voice that shows emotion. SOUND LINK emoSyn sounds joyful
Computers learning to decipher singing
Yamaha’s Vocaloid singing synthesizer sounds far more natural than most spoken voice synthesis and you might find it hard to tell it from a real singer in some styles of music. That’s because Yamaha records not just the phonemes – sung as nonsense phrases by a professional singer in multiple pitch ranges - that the software splices back together (converted into frequency using a Fast Fourier Transform), but also things like vibrato, pitch bend and attack that give the voice emotion and expression. When the musician inputs the MIDI track and lyrics they want the Vocaloid voice to sing – and there are several to choose from, ranging from opera to pop – they can also draw in the expressiveness they want on screen. Vocaloid is hugely popular in Japanese pop but you’ve also heard it in Supercell and Mike Oldfield recordings. SOUND LINK
Computers learning to listen with human brain techniques
Voice recognition doesn’t listen to sound and speech the way people do. That’s good for identifying the words someone is saying, but we’re good at tuning out background noise and other conversations (think of talking to someone at a noisy party), using cues like pitch, frequency, intensity, onset, spatial location, and duration. Audience’s earSmart processor for phones uses exactly the same principles as the human brain to do noise suppression and echo cancellation, down to recording from two microphones at the same time. It’s in the Nexus One, HTC Titan and Vivid, the Samsung Galaxy S II, the Sony Tablet S (and every other current phone from AT&T who insists partners use Audience).
The future of talking to computers
We asked the CEO of Nuance, Paul Ricci, what’s next for voice recognition. Understanding what you want and what you mean, he told us. “The problem of recognizing the words, the speech recognition problem, is transforming into a natural language recognition problem – the problem of understanding the words. It’s a speech recognition problem to know I’m saying ‘make a restaurant reservation’ and a natural language processing problem to understand the action I want to take.” And what comes after Siri? “We want to take advantage of lots of information. We can use your location. Another piece of information might be the history of restaurants you like; or things you’ve asked about before, actions you’ve taken, information from your social graph, from your calendar. We’re seeing early implementations like Siri but this is going to develop very fast. This is going to become standard in all smartphones.”