This AI model is learning to speak by watching videos — here's how

(Image credit: Puppies DenseAV)

The AI model DenseAV is learning the meaning of words and the location of sounds without human input or text simply by watching videos, researchers said.

In a paper, researchers from MIT, Microsoft, Oxford, and Google explained that DenseAV manages to do so using only self-supervision from video.

To learn these patterns it uses audio-video contrastive learning to associate a particular sound with the observable world. This mode of learning means the visual side of the model can’t gain any insights from the audio side (and vice-versa) forcing the algorithm to recognize objects in a meaningful way.

It learns by comparing pairs of audio and visual signals and determines what data is important. It then evaluates which signals match and which don’t. Since it’s easier to predict what you are seeing from what you are hearing when you understand language and can recognize sounds, this is how DenseAV can learn without labels.

How does it work?

AI Learns Language from Scratch - YouTube

Watch On

The idea for this process struck MIT PhD student Mark Hamilton while he was watching the movie March of the Penguins. There’s a particular scene where a penguin falls and lets out a groan.

“When you watch it, it’s almost obvious that this groan is standing in for a four-letter word. This was the moment where we thought, maybe we need to use audio and video to learn language,” Hamilton said in an MIT news release.

They found that one side of the brain naturally focused on language while the other focused on sounds like meowing.

His aim was to have his model learn a language by predicting what it’s seeing from what it’s hearing and vice-versa. So if you hear someone saying “grab that violin and start playing it” you’re likely going to see a violin or a musician. This game of matching audio to video was repeated across various videos.

Once this was done, the researchers focused on the pixels a model was looking at when it heard a particular sound — someone saying “cat” would trigger the algorithm to start looking for cats in the video. Seeing which pixels the algorithm selects means you can discover what it thinks a particular word means.

But let’s say DenseAV hears someone saying “cat” and it later hears a cat meowing, the AI might still identify an image of a cat in a shot. However, does it mean the algorithm thinks a cat is the same thing as a cat’s meow?

The researchers explored this by giving DenseAV a “two-sided brain” and they found that one side of the brain naturally focused on language while the other focused on sounds like meowing. So DenseA did actually learn the different meaning of both words without any human intervention.

Why is this useful?

DenseAV is an algorithm capable of discovering the meaning of language and locations of sounds just by watching unlabeled videos. DenseAV is completely unsupervised and never sees text during its training. Learn more: https://t.co/eG755yC9mI pic.twitter.com/3I1jJW8l08June 11, 2024

The massive amount of video content already out there means AI can be trained on things like instructional videos.

“Another exciting application is understanding new languages, like dolphin or whale communication, which don’t have a written form of communication,” Hamilton said.

The next step for the team is to create systems that can learn from video- or audio-only data which is helpful in areas where there’s lots of one type of material but less of the other.

More from Tom's Guide

Back to MacBook Air

Apple

Asus

Lenovo

8GB RAM

16GB RAM

24GB RAM

128GB

256GB

512GB

1TB

Black

Brown

Grey

Red

Silver

New

Refurbished

EMMC

SSD

Showing 10 of 26 deals

Filters☰

Apple MacBook Air M3

$899

View

Lenovo IdeaPad Duet 3

(128GB 8GB RAM)

$388

View

Asus Zenbook S 13 OLED

(OLED)

$1,399.99

View

Apple MacBook Pro 14-inch M3 (2023)

(1TB Silver)

Our Review

☆☆☆☆☆

(256GB 16GB RAM)

Lenovo IdeaPad Duet 3

$369.99

View

Asus Zenbook S 13 OLED

(OLED)

$1,599

View

Apple MacBook Pro 14-inch M3 (2023)

(1TB SSD)

Our Review

☆☆☆☆☆

(512GB 8GB RAM)

Asus Zenbook S 13 OLED

(OLED)

$1,599

View

See more AI News

Christoph Schwaiger is a journalist who mainly covers technology, science, and current affairs. His stories have appeared in Tom's Guide, New Scientist, Live Science, and other established publications. Always up for joining a good discussion, Christoph enjoys speaking at events or to other journalists and has appeared on LBC and Times Radio among other outlets. He believes in giving back to the community and has served on different consultative councils. He was also a National President for Junior Chamber International (JCI), a global organization founded in the USA. You can follow him on Twitter @cschwaigermt.