Google is developing a new AI speech-to-speech translator technology that could imitate your voice in real time. Its name is Translatotron, and it may be key to achieving the seamless universal translation dream we see on Star Trek within our lifespan.
The new tech skips the entire speech-to-text step of current translation technologies. Right now, you could speak into your microphone, get your speech recognized, and your phone would output the translated text to the screen with the option of reading it out loud in a generic synthetic voice.
The new Translatotron aims to do two things. First, eliminate the speech-to-text step and thus avoiding text-to-speech synthesis, going directly to a speech-to-speech model. And then, get rid of the generic voice and replace it with your own voice. While not perfect, the examples in the Google Research github page are pretty good (check out the “Predictions with voice transfer” column in the second “Conversational Spanish-to-English” section).
The paper — published on ArXiv by Google’s research scientists Ye Jia, Ron Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, and Yonghui Wu — describes that the team used a neural network to analyze the original speech spectrograms into target spectrograms in another language, reproducing the original voice.
The researchers acknowledge that the result is not perfect — yet. They are getting there, as this first research was to demonstrate the feasibility of this model: “The proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.”
The technology opens a path to a Star Trek-like future in which you would speak and, automagically, people will hear you actually speaking in their own language. Perhaps then humans will be able to understand each other, and get past one of the barriers that separate societies from one another. The end of Babel is nigh!