Siri is a breakthrough in speech recognition and control, but we've been talking to computers since the 1950s, with varying degrees of success - and machines have been talking back to us since the 1930s. The story of how we’ve made computers (and phones and cars, and soon home appliances) listen and talk is as fascinating as the results, with scandal and fraud as well as breakthroughs; military history and specialized - and secret - tools for the intelligence service are as much a part of it as university and commercial labs. Hollywood showed futuristic talking computers decades before they were possible but the music and film industry has also pushed developments in artificial speech and sound. Here’s how we got to Siri – and beyond.
Legends of medieval scholars with heads of bronze that not only spoke but told the future range from Roger Bacon to Faust; Bacon’s mechanism appears in an early science fiction story as The Brazen Android. Christian Kratzenstein built the first real speaking machine in 1773 using tubes and organ pipes to make artificial vocal chords. It only produced vowels but in 1791 Wolfgang von Kempelen (maker of the chess-playing Mechanical Turk – with a human player inside) created one with two resonating tubes; manipulating these in either hand, he made it ‘speak’ whole words. And in 1845 Joseph Faber toured his Euphonia speaking machine to audiences including the father of Alexander Graham Bell.
The first electronic speech synthesizer, the VODER (Voice Operating DEmonstratoR) built by Homer Dudley at Bell Labs, was difficult to operate and wasn’t that easy to understand when it said “Good evening, radio audience” at the 1939 New York World’s Fair – but it was good enough to have the New York Times declare “My God, it talks.” In 1936 the UK telephone service introduced the Speaking Clock with words and phrases that were joined into sentences. Like the 1950 Pattern Playback machine from Haskins Labs it used images of speech, stored as spectrograms, read by light and played back. SOUND LINKS The VODER speaks The Pattern Playback speaks
Dudley also developed the Vocoder, a speech compression system originally designed to save bandwidth on telephone networks (think of it as the physical equivalent of the Skype codec; compressing speech by running it through a series of filters and playing it back through another set of filters so you can send a much smaller signal). It was never used in telephones but became popular with musicians like Wendy Carlos, who used it on the soundtrack of Clockwork Orange. Many effects you think are a Vocoder – like the talking train in Dumbo or Sparky’s magic Piano – use the Sonovox instead. This has two discs you press against your throat; mouthing the words makes the discs vibrate, producing a similar effect.
Bell Labs was working on recognizing speech as well as just transmitting it and in 1952 researchers created Audrey, the Automatic Digit Recognizer, which could recognize individual spoken numbers with up to 99% accuracy. That was if it was a man speaking, with a distinct pause after each number, not using any words other than one through nine (and ‘oh’ for zero). Audrey had to be tuned for each speaker, who had to record enough samples for the system to store twenty typical patterns for each number in its analog memory. What Audrey was listening for were the ‘formants’ – the two peaks in the frequency of speech that are enough for us to tell vowels apart.
By the 1962 World’s Fair, IBM was showing off a smaller and more powerful system called Shoebox (because of the size and shape of the wooden case) that worked as a voice-controlled calculator. Shoebox recognized all ten numbers plus six commands including plus, minus and total so you could speak math problems into the microphone and get the results printed out on the manual calculator it was linked to. Each number recognized was displayed in lights so you could check it was getting the numbers right every time.
Speech in science fiction switched from repetitive monotones (“Warning, Will Robinson!”) to natural and realistic. The computer in Star Trek (voiced by Majel Barrett), C3PO in Star Wars – and of course, HAL 9000 in 2001. In the early 1960s Arthur C Clarke was visiting Bell Labs when John Kelly and Carol Lockbaum programmed an IBM 704 – the first mass-produced computer with floating point arithmetic and core memory rather than vacuum tubes – to sing Daisy Bell (better known as A Bicycle Built For Two). It was later released as Music from Mathematics. Impressed, Clarke added it to his screenplay and Stanley Kubrick used an Eltro Mark II audio processor to change the pitch and speed of actor Douglas Rain’s voice as he sang Daisy, Daisy. SOUND LINK IBM sings Daisy Bell
Clarke was at Bell Labs to visit executive director John Pierce who decided in 1969 that general purpose speech recognition was as likely as “curing cancer or going to the moon” and stopped working on it. DARPA funded the Speech Understanding Research program in 1971. The most successful project was Carnegie Mellon’s Harpy that was 95% accurate recognizing continuous speech with a vocabulary of 1,011 words. However it needed training, it took 80 times longer to recognize your sentence than it took you to speak it and it only worked with words in specific order. IBM’s voice-activated typewriter – connected to an IBM 370 computer – also had a thousand word vocabulary but it took an hour to process a sentence. Still the 1970s saw the first commercial speech recognition company; Threshold sold its VIP-100 system to FedEx for sorting package on a conveyer belt.