LAS VEGAS — Machine-generated speech, created using regular personal computers and free software, can fool voice authentication, two researchers showed at the DEF CON 26 hacking conference here Friday (Aug. 10).
Faking someone else's voice using text-to-speech (TTS) programs once required hundreds of hours of audio samples of the targeted individual's voice, as well as massive amounts of computing power, Salesforce researchers John Seymour and Azeem Aqil said.
But recent advances in open-source text-to-speech programs, and the researchers' own methods, make it possible for anyone with free technology, a few hours of audio samples and a lot of free time to convincingly fake the voice of a specific individual speaking a preset passphrase.
"Speaker recognition and speaker authentication are two different things," Seymour said. "Speech authentication can be broken if the attacker has speech data of the target and knows the authentication prompt."
MORE: Best Smart Speakers
Aqil and Seymour were inspired by a scene in the 1992 hacker movie "Sneakers," in which Robert Redford's character gets past a voice-authenticated lock by playing back a tape recording of an authorized user speaking the passphrase: "My voice is my passport. Verify me."
Clients of investment bank Charles Schwab use a very similar phrase, Aqil and Seymour said, to log into their accounts over the phone: "My voice is my password."
Microsoft is currently beta-testing a voice-authorization feature. Even Apple and Google are using voice recognition, although Seymour noted that neither company claims that the feature should be used for serious authentication.
Specific spoken passphrases are used for voice authentication because machine translation of human speech is still incredibly difficult. When you speak to Amazon Alexa, Apple Siri, Google Assistant or Microsoft Cortana, only the invocation phrase, such as "Hey, Siri," is actually processed on the device.
Everything you say after that is recorded and uploaded to cloud servers, where your speech is played back, translated into text and read by back-end services. Those services create a response and then send back a machine-generated speech, or an instruction to play a certain piece of music, or whatever else you may have requested.
By restricting the verification passphrase to a few specific words, a machine performing voice-based authentication doesn't have to send the audio clip up to cloud services or perform massive data-crunching on the premises. It only has to compare the waveform of the newly recorded clip to audio clips that you'd recorded earlier.
Unfortunately, that creates a huge advantage for an attacker. All he or she has to do is create an audio clip of what sounds very much like you saying the exact words in the passphrase.
In "Sneakers," the hackers tricked the targeted individual into speaking the exact phrase and secretly recorded him. Aqil and Seymour sought a different approach: To train a machine to sound enough like the targeted person so that its machine-generated passphrase could fool a voice authenticator.
Aqil and Seymour did just that by using an online service called Lyrebird, which lets clients create "voice avatars" of themselves by recording about 30 preset phrases. Seymour signed up for a free trial, and the results from the service were good enough to fool Microsoft's voice authorization beta.
Here's a clip of Seymour speaking the passphrase, "My voice is stronger than passwords," followed by the Lyrebird version.
That still sounds pretty robotic, though. To get really good voice fakes, you need a lot of samples and a lot of computing time — or do you?
Getting enough voice samples doesn't sound like it would be hard to do, if the target is someone whose voice recordings are well distributed. Last year, Buzzfeed created a famous clip that showed Barack Obama "speaking" words that were actually spoken by actor/director Jordan Peele.
That was impersonation by a human rather than by a machine, but Seymour and Aqil said Obama's speaking voice has been recorded so many times that you could theoretically get enough material to train a machine to learn and replicate his voice.
But there are some catches. You'd have to narrow the Obama recordings down to those with the best, clearest audio. You'd need to break the longer samples into chunks of 10 seconds or less, because that's what machine-learning software can digest most easily.
Then you'd need to transcribe all the Obama speech samples so that the machine could compare them to the audio samples during training. You'd probably have to transcribe the speech samples by hand, which could take days or weeks, unless you had access to specialized software on cloud-based servers. Then you'd feed both text and audio into the machine to begin the training, which, by itself, could take weeks.
This all clearly takes too long. So Aqil and Seymour found some workarounds. They increased the number of training samples by slowing down and speeding up the existing audio samples by about 20 percent in each direction and re-inputting it. Even though the machine had already reviewed those samples, the pitch variation helped in the training.
They decided not to transcribe any audio. Instead, Seymour recorded himself reading a lot of prepared text.
Aqil and Seymour also found that they could start the machine training on open-source speech libraries, which featured thousands of hours of male and female voices. Once the machine got the basics down, they could switch the machine to the targeted person's voice. They called this "transfer learning."
The researchers recommended the open-source text-to-speech packages Tacotron and WaveNet, although they preferred the former. (There are very impressive examples of text-generated speech at Google's Tacotron page, but of course, that used a lot of computing power.) For samples of human speech to train the software, they used the open-source Blizzard and LJ Speech repositories.
The end results were convincing, if not perfect. The first example here is Tacotron generating the sentence "I am going to make you an offer you cannot refuse" after having been trained on the Blizzard speech library, without transfer learning.
The second example features the same phrase, but after Seymour's voice had been added to the training in the transfer-learning method. It still sounds a bit robotic, and isn't as good as Google's own Tacotron samples, but should be enough to fool voice authentication.
Seymour said the implications of this went beyond voice authentication. As in the Obama video, faked voices could be used to fake political speeches. They could also be used in phishing attempts and other forms of social engineering, especially those involving phone calls or voicemail.
As the pair showed, such attacks are no longer limited to well-funded internet services or nation-state attackers. Expect to hear many more convincing robot voices in your future.
Aqil and Seymour's presentation slides and audio clips are available on the DEF CON website.