Google's Latest AI Trick Is Picking Voices Out of Crowds
As smart speakers and voice-based AI assistants rise in popularity, these tools need to get smarter at knowing when you're asking for something, as opposed to, say, when someone on your TV uses a trigger word.
Fortunately, it looks like Google's got a solution.
According to a new research paper, which has video evidence to back its claims up, a team of Google researchers have built a deep learning system that can identify and single out individual's voices. And just like when you're sitting at a big table, or out on the town, it looks at faces to figure it out.
How does it work? The system was built to identify a speaker talking out loud, matching their face to their sounds. To amp up the difficulty, Google piped in the audio from virtual crowds of people, to teach the AI how to distinguish the voices it heard against the deluge of noise around it.
A demonstration video showing off this technology focuses on comedians Jon Dore and Rory Scovel, who are talking at the same time. In the clip, pink and blue boxes overlay over each of their heads, and then the soundwaves on the bottom of the screen gain those same pink and blue hues, to show how their faces have been matched to their voices.
Then, the slider bar on the bottom of the screen moves horizontally, between labels marked All, John, and Rory. The sound fluctuates along with it, allowing you to hear both at once, and then only hearing one and muting out the rest.
Where and how Google implements this in its product line remains to be seen, but its Hangouts chat client and YouTube videos seem like ideal places to test it out. Further, if you added a camera to a Google Home speaker, the device could do a much better job of knowing who's talking and delivering personalized results.