ChatGPT-5 can now see and hear better than ever — here's why that matters

Image of a phone with chatgpt logo on a keyboard
(Image credit: ChatGPT AI generated image)
What is 'Multimodality'?

AI in man's hand

(Image credit: Shutterstock)

In the case of an AI, multimodality is the ability to understand and interact with input beyond just text. That means voice, image or video input. A multimodal chatbot can work with multiple types of input and output.

This week's GPT-5 upgrade to ChatGPT dramatically raises the chatbot's speed and performance when it comes to coding, math and response accuracy. But arguably the most useful improvement in the grand scheme of AI development will be its multimodal capabilities.

ChatGPT-5 brings an enhanced voice mode and a better ability to process visual information. While Sam Altman didn't go into details on multimodality specifically in this week's GPT-5 reveal livestream, he previously confirmed to Bill Gates on an episode of the latter's podcast that ChatGPT is moving towards "speech in, speech out. Images. Eventually video."

The improved voice mode courtesy of GPT-5 now works with custom GPTs and will adapts its tone and speech style based on user instruction. For example, you could ask it to slow down if it's going to fast or make the voice style a bit warmer if you feel the tone is too harsh. OpenAI has also confirmed the old Standard Voice Mode across all its models is being phased out over the next 30 days.

Of course, the majority of interaction with ChatGPT, or any of its best alternatives, will be through text. But as AI becomes an increasing part of every human's digital lives, it will need to transition fully into predominantly multimodal input.

We've seen this before; social media only really got going when it moved off laptops and desktops and onto smartphones.

Suddenly, users could snap pictures and upload them with the same device. Whether or not it's your phone or — as Zuckerberg will have you believe — a set of the best smart glasses is beside the point. The most successful AI will be the one that can make sense of the world around it.

Why does this matter?

Voice model demo

A demo of the improved Voice Mode during OpenAI's GPT-5 livestream (Image credit: OpenAI)

GPT‑5 has been designed to natively handle (and generate) across multiple different types of data within a single model. Previous iterations had used a plugin-style approach so moving away from that should result in more seamless interactions, whichever type of input you choose.

There are a huge amount of benefits to a more robust multimodal AI, including for users who may have hearing or sight impairments. The ability to refine the responses from the chatbot to suit disabilities will do wonders for tech accessibility.

There are a huge amount of benefits to a more robust multimodal AI, including for users who may have hearing or sight impairments.

The increasing use of voice mode could be what drives the adoption of ChatGPT Plus, since the premium tier has unlimited responses while free users are still limited to a select number of hours.

Meanwhile, improved image understanding means that, for example, the AI will be less prone to hallucinations when analyzing a chart or a picture you give it. That works in tandem with the tool's "Visual Workspace" feature that means it can interact with charts and diagrams. In turn, this will also train ChatGPT to produce better and more accurate images when prompted.

If you think about this in an educational context, it's going to be a huge help. Especially since GPT-5 can now understand information across much longer stretches of conversation — users can refer back to images earlier in the conversation and it will remember them.

While everyone knows that AI image generation has a dark side, there's no doubt that effective multimodality is the future of AI models and it'll be interesting to see what Google Gemini's response is to these GPT-5 upgrades.

Follow Tom's Guide on Google News to get our up-to-date news, how-tos, and reviews in your feeds. Make sure to click the Follow button.

More from Tom's Guide

Jeff Parsons
UK Editor In Chief

Jeff is UK Editor-in-Chief for Tom’s Guide looking after the day-to-day output of the site’s British contingent.

A tech journalist for over a decade, he’s travelled the world testing any gadget he can get his hands on. Jeff has a keen interest in fitness and wearables as well as the latest tablets and laptops.

A lapsed gamer, he fondly remembers the days when technical problems were solved by taking out the cartridge and blowing out the dust.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.