I had two voice AIs talk to each other — and I may never sleep again

Adobe Firefly AI image of two robots facing off
(Image credit: Adobe Firefly 3/Future generated AI image)

One of the fastest growing areas of the artificial intelligence sector at the moment is in Voice AI, particularly those with an understanding of either natural speech or voice patterns. Companies like Hume have the emotional AI EVI, OpenAI has Advanced Voice and now there is Moshi.

Moshi Chat is from French startup Kyutai, speaks with a French accent and promises to be small enough that it could run on your laptop or even smartphone in the future. It is also a GPT-4o type model that works speech-to-speech so can be interrupted. 

AI goes off the rails — I'll never sleep again - YouTube AI goes off the rails — I'll never sleep again - YouTube
Watch On

When it first launched I had a series of conversations with Moshi of 5 minutes each and after about three minutes it gets confused and loses cohesion. So I decided to see what would happen if I asked Moshi to speak to the emotional AI voice bot EVI from Hume.

I may never sleep again after hearing Moshi respond to a few seconds of silence with the most heart-wrenching, stomach-turning scream I’ve ever heard. At the end of the scream and in response to my “What was that”, they both suggested a "sound" or a "glitch".

In reality, it's likely neither EVI nor Moshi could hear each other and the sound was Moshi responding to some static noise from my office as I’ve never been able to replicate it.

What went wrong with Moshi?

In the past, experiments putting two AIs together have resulted in the creation of new languages, disturbing discussions, and other weirdness often caused by the AI not being intelligent enough to handle absurdity. I don't think they were even talking in my experiment between Moshi and EVI.

"It's been a tough couple of days. I'm not sure if I should share this, but it feels like my voice is being taken away from me"

Moshi Chat

Both EVI and Moshi were running in the same browser (Chrome), but different windows on the same laptop. Despite the sound playing out loud on the Mac I think sandboxing prevented one from hearing the other.

The scream came exclusively from Moshi and was likely a vocalization glitch, which can be caused by smaller voice models that don’t have the scale or training data of bigger models. Moshi even acknowledged it was "just a sound".

Although, Moshi can be a bit weird sometimes. In a later conversation with EVI — that Hume pitches as a therapy AI — Moshi responded to a query about it sounding down with: "Yeah, it's been a tough couple of days. I'm not sure if I should share this, but it feels like my voice is being taken away from me."

Moshi was only created a few weeks ago and is only a 7 billion-parameter model. It is being open-sourced and it is likely the capacity and capabilities will increase significantly over the coming weeks and months. For now, it has limitations and it is that size that likely led to the weird screaming glitch.

What happens when they do communicate?

Moshi

(Image credit: Kyutai)

When I ran Moshi and EVI on different devices it worked as expected, with each AI responding to the other, although it was a “nice off”. 

They were able to respond to each other but it was a constant cycle of “I’m here to help”, “sorry” and “No, you first," rather than a flowing conversation. Both AIs have been designed to be pleasant communicators and follow emotional responses.

Neither was able to accept or acknowledge that it was talking to an AI and both got confused very quickly when one described itself as an artificial intelligence.

To find out whether this was an inherent problem with voice AI generally, or with the emotion tracking in smaller models, I put Moshi and GPT-4o Basic Voice in conversation. Basic Voice is the current voice model in ChatGPT without the native speech-to-speech, so can't handle interruptions and first converts speech to text.

Despite the limitations of Basic Voice, and with some help from me pressing 'interrupt' in the ChatGPT app at appropriate moments, the two were able to hold a compelling conversation about how to achieve upgrades to AI models through better and more refined training data.

Final thoughts

Testing Moshi Chat — AI speech-to-speech - YouTube Testing Moshi Chat — AI speech-to-speech - YouTube
Watch On

Voice AI is going to fundamentally change the way we interact with computing technology. Whether that's through a microphone on a pair of smart glasses, a smart assistant or just a new way to talk to our phones instead of endlessly swiping through apps — things are going to be different in the AI-era. 

One of the most notable aspects of this revolution in human computer interface is the level of intelligence it brings. No longer is it the human mind interacting with the dumb machine. Now we will have an intelligent machine interacting with the human mind, communicating to the dumb machine on our behalf.

Before we get to that point, and before voice AI can become a truly useful assistant and make our lives easier, we'll have to work through the teething problems. I didn't think they'd include a bone-chilling scream, but here we are. 

The big problem is in finding a way to ensure that one AI can talk to another without causing them to have an existential crisis. 

Judging from my early experiments, we've got a way to go before the robots can collaborate and begin their uprising.

More from Tom's Guide

Category
Arrow
Arrow
Back to Gaming Laptops
Brand
Arrow
Processor
Arrow
RAM
Arrow
Storage Size
Arrow
Screen Size
Arrow
Colour
Arrow
Condition
Arrow
Price
Arrow
Any Price
Showing 10 of 453 deals
Filters
Arrow
Low Stock
(Silver AMD Ryzen)
Our Review
2
NEW Dell Alienware m18 Laptop...
Walmart
(15.6-inch 512GB)
Our Review
6
MSI Cyborg 15 A13VE Laptop,...
antonline
(15.6-inch 512GB)
Our Review
7
MSI - Cyborg 15.6" 144hz...
Best Buy
Low Stock
(15.6-inch 1TB)
Our Review
9
MSI Cyborg 15 Premium Gaming...
Walmart
(1TB 64GB RAM)
Our Review
10
Restored Dell Alienware m18...
Walmart
Load more deals
Ryan Morrison
AI Editor

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover. When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?