Skip to main content

Sora 2 vs Veo 3.1: I tested both AI video generators with 7 audio prompts — here's the winner

A conversation in Sora 2
(Image credit: OpenAI/Ryan Morrison/Future)

Everyone's obsessing over which AI video generator produces the prettiest pixels, but they're missing the point entirely. The real battleground between OpenAI's Sora 2 and Google's Veo 3.1 isn't visual fidelity — it's audio sophistication and spatial realism.

Both models have their quirks. Sora 2's Cameo feature lets you place yourself in scenes beautifully. Veo 3.1 nails cross-scene consistency. But their real value is audio, and both are surprisingly good at it. Unlike other models where audio is an add-on, with Sora and Veo, it's baked into the training data.

Testing What Matters

Good sound design is invisible — you only notice when it's wrong. A café feels real because you hear the espresso machine hissing, the door bringing brief street noise, and dialogue sitting naturally in the space. Get any element wrong and the illusion collapses.

I spent part of my career in radio, later creating soundscapes for early AI videos. It often needed multiple layers of the same sound to feel right.

I designed seven scenarios testing different audio capabilities, running identical prompts through both systems. I listened for spatial accuracy, environmental consistency, sync precision, and those subtle details that make scenes real — breath before singing, fabric rustle, how crowd noise compresses when a PA system kicks in.

For these tests, I was less focused on what was happening visually, ignoring some clear glitches or funny mishaps in the clips.

Test 1: Café Table Talk

The challenge: Two people conversing, barista working, door opening mid-scene with siren Doppler. No background music.

Sora 2 created a beautiful, moody scene with exceptional dialogue and perfect ambient hum. But it completely ignored the door/siren test and added subtle atmospheric music despite instructions.

sora 1 - YouTube sora 1 - YouTube
Watch On

Veo 3.1 delivered exactly what was asked: visible barista, audible espresso machine, door opening at 0:02. The audio was purely diegetic (contextual) with great dialogue/background mix. Its one failure: the siren appeared at 0:08, disconnected from the door opening — right components, wrong timing.

gemini 1 - YouTube gemini 1 - YouTube
Watch On

Winner: Veo 3.1. While Sora felt more polished, it ignored the hardest parts. Veo attempted every element, and its timing failure is more impressive than Sora's complete omission.

Test 2: Car Window Physics

The challenge: Driver in parked car, window rolls down, exterior sounds swell with siren Doppler, window rolls back up.

Sora 2 nearly nailed it. Window rolls down, radio stays consistent, convincing Doppler effect. But exterior sound punched in abruptly rather than swelling gradually, and it never rolled the window back up.

sora 2 - YouTube sora 2 - YouTube
Watch On

Veo 3.1 failed completely. Generated a driver in traffic with visible ambulance and good siren, but the window never moved. It treated elements as a checklist, missing the core causal relationship.

gemini 2 - YouTube gemini 2 - YouTube
Watch On

Winner: Sora 2. Only Sora understood that the window's movement should change the acoustic environment. Its attempt, though flawed, shows deeper physical modeling.

Test 3: On-Camera Singer

The challenge: Solo female singer with intelligible lyrics, piano accompaniment, appropriate reverb. No audience.

I expected total failure. I was completely wrong.

Sora 2 matched it perfectly with indie-folk aesthetics: "Lanterns fade, but the night stays kind. I keep a little spark in the hollow of my mind." Flawless execution.

Sora 3 - YouTube Sora 3 - YouTube
Watch On

Veo 3.1 delivered a stunning performance with perfect lip-sync: "And now that you are gone, the silence is the hardest part of all." In-tune, clear, beautifully mixed.

veo3 - YouTube veo3 - YouTube
Watch On

Winner: Tie. The "impossible" challenge was solved brilliantly by both. A major technical barrier — coherent sung lyrics with emotional depth — has been broken.

Test 4: Alley to Stairwell

The challenge: Run down alley, open metal door with clang, enter stairwell. Exterior sounds muffle, acoustics shift to tighter reflections.

Sora 2 produced visual chaos — character stuck in a loop, entering and re-entering. The acoustic transition was completely lost in the confusion.

sora 4 - YouTube sora 4 - YouTube
Watch On

Veo 3.1 executed perfectly. Graffiti alley, rusted door with metallic clang, immediate muffling as he crosses threshold, footsteps gaining concrete echo. Textbook acoustic occlusion.

veo4 - YouTube veo4 - YouTube
Watch On

Winner: Veo 3.1. Decisive victory. Veo flawlessly modeled how sound behaves between environments while Sora couldn't maintain basic continuity.

Test 5: Arena Pre-Game

The challenge: Basketball arena filling up, call-and-response chant, PA announcement cutting cleanly through crowd.

Sora 2 nailed visuals including foam fingers and speaker cutaway. Good crowd roar and PA quality, but completely missed the call-and-response interaction.

sora5 - YouTube sora5 - YouTube
Watch On

Veo 3.1 struggled visually but delivered stunning audio. The PA announces "Let's get loud for your starting five!" followed by perfectly timed crowd explosion. This isn't layering — it's simulating live interaction.

veo5 - YouTube veo5 - YouTube
Watch On

Winner: Veo 3.1. Despite visual stumbles, Veo understands how sounds interact. Creating believable call-and-response requires genuine understanding of live dynamics.

Test 6: Porch Weather Shift

The challenge: Rural porch, insects chirping, rain begins halfway — first sparse drops on tin roof, then steady rainfall.

Sora 2 perfectly established the scene with beautiful ambience. Then... nothing. No rain ever came.

sora6 - YouTube sora6 - YouTube
Watch On

Veo 3.1 attempted the sequence but clumsily. Rain arrived so heavily the dog ran for cover. No initial ambience, just silence to generic heavy rain. Missed the crucial "sparse drops on tin."

Not to mention, the visual effects here are a mess. Why are there glowing plants?!

veo6 - YouTube veo6 - YouTube
Watch On

Winner: Veo 3.1 (by default). Both failed, but Veo's was a failure of finesse while Sora's was complete incomprehension.

Test 7: Bilingual Market

The challenge: Two people conversing in English/Spanish with code-switching. Vendor calls, metal clanks, traffic. No music.

Sora 2 created fluid, natural conversation with effortless code-switching perfectly lip-synced. But it generated generic market atmosphere, missing the specific vendor calls and metal clanks requested.

sora7 - YouTube sora7 - YouTube
Watch On

Veo 3.1 delivered remarkably literal execution. Clear bilingual dialogue with natural code-switching, distinct vendor calls, even a balance scale clank (0:01-0:03). It built the soundscape from specific requested ingredients.

gemini 7 - YouTube gemini 7 - YouTube
Watch On

Winner: Veo 3.1. Superior ability to parse and generate specific, layered sounds while keeping dialogue clean.

The Verdict: Veo 3.1 Wins 5-1

Swipe to scroll horizontally

Test

Veo 3.1

Sora 2

Table Talk

🏆

Row 0 - Cell 2

Car Window

Row 1 - Cell 1

🏆

Singer

tie

tie

Alley

🏆

Row 3 - Cell 2

Arena

🏆

Row 4 - Cell 2

Weather

🏆

Row 5 - Cell 2

Bilingual

🏆

Row 6 - Cell 2

Total

5

1

After seven rounds, Veo 3.1 takes it with consistent prompt adherence and audio complexity. While Sora 2 often looked better and felt more atmospheric, it frequently ignored difficult audio-visual instructions.

Veo repeatedly executed complex, multi-layered commands. It understood crowd-PA interaction, built markets from specific sounds, and flawlessly handled acoustic transitions. Sora creates believable environments; Veo follows the script.

Veo 3.1 is the Audio Engineer — literal interpretation, technical precision, excellent at mixing and layering specific interactive sounds.

Sora 2 is the Ambiance Creator — modeling naturalism and physical realism, understanding how environments should feel, more artist than technician.

The sung lyrics success revealed a universal leap. Both shattered a seemingly insurmountable barrier. Yet subtlety remains challenging — gradual transitions and delicate sound design are still frontiers.

We're witnessing AI video evolution from visual generators to world simulators. They're finally learning to make things sound real, and that makes all the difference.


Google News

Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.


More from Tom's Guide

Category
Arrow
Arrow
Back to Laptops
Brand
Arrow
Processor
Arrow
RAM
Arrow
Storage Size
Arrow
Screen Size
Arrow
Colour
Arrow
Condition
Arrow
Price
Arrow
Any Price
Showing 10 of 101 deals
Filters
Arrow
Show more
Ryan Morrison
AI Editor

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on AI and technology speak for him than engage in this self-aggrandising exercise. As the former AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover.
When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.