WALT is a new AI video tool that creates photorealistic clips from a single image — you have to see it to believe it

(Image credit: Stanford AI Lab)

A new artificial intelligence model called WALT can take a simple image or text input and convert it into a photorealistic video. Preview clips include dragons breathing fire, asteroids hitting the Earth and horses walking on a beach.

One of the more notable advances made by the Standford University team behind WALT is the ability to create consistent 3D motion on an object and do so from a natural language prompt.

Creating video from images or text is the next big frontier. It is a complex problem to solve, requiring more than just stitching a sequence of images together as each frame has to be a logical follow-on from the previous to create fluid motion.

What makes WALT stand out?

WALT can create 3D motion — WALT was specifically trained to create fluid 3D motion (Image credit: Standford AI lab)

Companies like Pika Labs, Runway, Meta and StabilityAI all have generative video models that have varying degrees of fluidity, coherence and quality. Agrim Gupta, the researcher behind WALT, says it can generate video from text or images and be used for 3D motion.

Gupta says WALT was trained with both photographs and video clips stored inside the same latent space. This allowed for training across both at the same time, giving the model a deeper understanding of motion from the start.

We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇 pic.twitter.com/uJKMtMsumvDecember 11, 2023

WALT is designed to be scalable and efficient, allowing for state-of-the-art results for image generation across three models covering image and video. This allows for higher resolution and consistent motion.

"While generative modeling has seen tremendous recent advances for image," wrote Gupta and colleagues, "progress on video generation has lagged." He believes that a unified image and video framework will close the gap between image and video generation.

How does WALT compare to Runway and Pika Labs?

WALT was trained on images and video — The AI model can also create fluid motion within a section of a video (Image credit: Stanford AI Lab)

The quality of motion in WALT seems to be a step up on other recent video models, particularly around 3D movement such as a burger turning on a table or horses walking. However, the quality of the output is a fraction of that built by Runway or Pika Labs.

However, this is a research model and the team is building it to scale. First, the base model produces small 128 x 128 pixel clips. This is then upsampled twice to get to 512 x 896 resolution at eight frames per second.

In contrast, Runway’s Gen-2 can create video clips up to 1536x896, although that requires a paid subscription. The default, free version generates video up to 768 x 448, so not as high of a resolution as possible with WALT.

Pika Labs works at similar resolutions but both Runway and Pika Labs can generate up to 24 frames per second, closer to that of production-quality video than the eight frames from WALT.

More from Tom's Guide

See more AI News

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on AI and technology speak for him than engage in this self-aggrandising exercise. As the former AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover.
When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing.