Google unveiled its impressive sounding Gemini artificial intelligence models last week, including the flagship Gemini Ultra with a video that appeared to show it responding in real-time to changes in a video — the problem is, Google faked it.
The reality of the promotional clip from Google is that they did have Gemini Ultra solve the problems demonstrated but from still images and over a longer period.
To see whether it is even possible to do things like have an AI play the find the ball game, identify locations on a map or spot changes in an image as you draw it, Greg Technology created a simple app to see how well GPT-4V handles the same concept.
So what exactly happened with Gemini?
Gemini Ultra has been trained as multimodal from the ground up. That means its dataset included images, text, code, video, audio and even motion data. It allows it to have a broader understanding of the world and see it “as humans do”.
To demonstrate these capabilities Google released a video where different actions were being performed and the voice of Gemini was describing what it could see.
In the video, it seems like this is all happening live, with Gemini responding to changes as they happen, but this isn’t exactly the case. While the responses are real they were still images or in segments rather than in real-time. Put simply, the video was more a marketing exercise than a technical demo.
So OpenAI GPT-4 can already do this?
In a short two-minute video, Greg, who makes demos of new technology for his channel, explained being excited by the Gemini demo but was disappointed to find it wasn't real time.
“When I saw that I thought, that is kind of strange, as with GPT-4 vision, which came out a month ago has been doing what is in the demo only it is real,” he said.
The conversation with GPT-4 is similar to the Voice version of ChatGPT with responses using a similar natural tone. The difference is this included video and had the OpenAI model respond to hand gestures, identify a drawing of a duck on water and play rock, paper, scissors.
The code used to make the ChatGPT Video interface used in the demo video has been released on GitHub by Greg Technology so others can also try it out for themselves.
Trying out the GPT-4 Vision code
I installed the code produced by Greg Technology on my Apple MacBook Air M2 and paired it with my GPT-4V API key to see if this video worked and wasn’t another “fake demo”.
After a few minutes, I had it installed and running and it worked perfectly. Happily identifying hand gestures, my glass coffee cup and a book. It could even tell me its title and author.
What this shows is just how far ahead of the pack OpenAI is, especially in terms of multimodal support. While other models can now analyze the contents of an image, they’d struggle with real-time video analysis.
More from Tom's Guide
Get the BEST of Tom’s Guide daily right in your inbox: Sign up now!
Upgrade your life with the Tom’s Guide newsletter. Subscribe now for a daily dose of the biggest tech news, lifestyle hacks and hottest deals. Elevate your everyday with our curated analysis and be the first to know about cutting-edge gadgets.
Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover.
When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?