A new artificial intelligence model called WALT can take a simple image or text input and convert it into a photorealistic video. Preview clips include dragons breathing fire, asteroids hitting the Earth and horses walking on a beach.
One of the more notable advances made by the Standford University team behind WALT is the ability to create consistent 3D motion on an object and do so from a natural language prompt.
Creating video from images or text is the next big frontier. It is a complex problem to solve, requiring more than just stitching a sequence of images together as each frame has to be a logical follow-on from the previous to create fluid motion.
What makes WALT stand out?
Companies like Pika Labs, Runway, Meta and StabilityAI all have generative video models that have varying degrees of fluidity, coherence and quality. Agrim Gupta, the researcher behind WALT, says it can generate video from text or images and be used for 3D motion.
Gupta says WALT was trained with both photographs and video clips stored inside the same latent space. This allowed for training across both at the same time, giving the model a deeper understanding of motion from the start.
We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇 pic.twitter.com/uJKMtMsumvDecember 11, 2023
WALT is designed to be scalable and efficient, allowing for state-of-the-art results for image generation across three models covering image and video. This allows for higher resolution and consistent motion.
"While generative modeling has seen tremendous recent advances for image," wrote Gupta and colleagues, "progress on video generation has lagged." He believes that a unified image and video framework will close the gap between image and video generation.
How does WALT compare to Runway and Pika Labs?
The quality of motion in WALT seems to be a step up on other recent video models, particularly around 3D movement such as a burger turning on a table or horses walking. However, the quality of the output is a fraction of that built by Runway or Pika Labs.
However, this is a research model and the team is building it to scale. First, the base model produces small 128 x 128 pixel clips. This is then upsampled twice to get to 512 x 896 resolution at eight frames per second.
In contrast, Runway’s Gen-2 can create video clips up to 1536x896, although that requires a paid subscription. The default, free version generates video up to 768 x 448, so not as high of a resolution as possible with WALT.
More from Tom's Guide
Get the BEST of Tom’s Guide daily right in your inbox: Sign up now!
Upgrade your life with the Tom’s Guide newsletter. Subscribe now for a daily dose of the biggest tech news, lifestyle hacks and hottest deals. Elevate your everyday with our curated analysis and be the first to know about cutting-edge gadgets.
Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover.
When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?