Study finds ChatGPT-5 is wrong about 1 in 4 times — here's the reason why
AI chatbots don’t just make things up, they’ve been trained and rewarded to do it
Here at Tom’s Guide our expert editors are committed to bringing you the best news, reviews and guides to help you stay informed and ahead of the curve!
You are now subscribed
Your newsletter sign-up was successful
Want to add more newsletters?
Daily (Mon-Sun)
Tom's Guide Daily
Sign up to get the latest updates on all of your favorite content! From cutting-edge tech news and the hottest streaming buzz to unbeatable deals on the best products and in-depth reviews, we’ve got you covered.
Weekly on Thursday
Tom's AI Guide
Be AI savvy with your weekly newsletter summing up all the biggest AI news you need to know. Plus, analysis from our AI editor and tips on how to use the latest AI tools!
Weekly on Friday
Tom's iGuide
Unlock the vast world of Apple news straight to your inbox. With coverage on everything from exciting product launches to essential software updates, this is your go-to source for the latest updates on all the best Apple content.
Weekly on Monday
Tom's Streaming Guide
Our weekly newsletter is expertly crafted to immerse you in the world of streaming. Stay updated on the latest releases and our top recommendations across your favorite streaming platforms.
Join the club
Get full access to premium articles, exclusive features and a growing list of member rewards.
The other day I was brainstorming with ChatGPT and all of a sudden it went into this long, fantasy story that had nothing to do with my queries. It was so ridiculous that it made me laugh. Lately, I haven't seen mistakes like this as often with text prompts, but I still see them pretty regularly with image generation.
These random moments when a chatbot strays from the task are known as hallucinations. What's odd is that the chatbot is so confident about the wrong answer it's giving; one of the biggest weakness of today's AI assistants. However, a new study from OpenAI argues these failures aren’t random, but a direct result of how models are trained and evaluated.
Why chatbots keep guessing when they shouldn’t
Research points to a structural issue causing hallucinations; essentially the problem stems from benchmarks and leaderboards ranking AI models and rewarding confident answers.
In other words, when a chatbot says “I don’t know,” it gets penalized in testing. That means the models are effectively encouraged to always provide an answer, even if they’re not sure it’s right.
In practice, that makes your AI assistant more likely to guess than admit uncertainty. For everyday queries, this can be harmless. But in higher-stakes cases, from medical questions to financial advice, those confident errors can quickly turn dangerous.
As a power user, that's why I always fact-check and ask the chatbot to cite the source. Sometimes if the information seems too far-fetched and I ask for a source, the chatbot will say something like, "Good catch!" or something similar, still not admitting it was wrong.
Newer models aren’t immune
Interestingly, OpenAI’s paper found that reasoning-focused models like o3 and o4-mini actually hallucinate more often than some older models. Why? Because they produce more claims overall, which means more chances to be wrong.
So, if a model is “smarter” at reasoning, it really doesn’t make it more honest about what it doesn’t know.
What can fix this problem?
Researchers argue that the solution is to change how we score and benchmark AI. Instead of punishing models for saying “I’m not sure.” The most valuable tests should reward calibrated responses, uncertainty flags or the ability to defer to other sources.
That could mean your future chatbot might hedge more often, less “here’s the answer” and more “here’s what I think, but I’m not certain.” It may feel slower, but it could dramatically reduce harmful errors. Proving that crtitical thinking on our part is still important.
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
Why it matters for you
If you’re using popular chatbots including ChatGPT, Gemini, Claude or Grok you’ve almost certainly seen a hallucination. This research suggests it’s not entirely the model’s fault, but the way they are tested; as if it's a game testing which can be right most often.
For users, that means we need to be diligent and consider AI answers as a first suggestion and not the final word. And for developers, this is a sign that it's time to rethink how we measure success so that future AI assistants can admit what they don’t know instead of getting things completely wrong.
Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button!
More from Tom's Guide
- Google's Mixboard is like a Pinterest-style AI tool — here's how it works and how to use it
- 5 totally free and under-the-radar ways to use NotebookLM — you’ll wish you tried them sooner
- Just when I thought I'd tried all the Nano Banana trends, I discovered these 5 new ones — and I think I'm obsessed

Amanda Caswell is an award-winning journalist, bestselling YA author, and one of today’s leading voices in AI and technology. A celebrated contributor to various news outlets, her sharp insights and relatable storytelling have earned her a loyal readership. Amanda’s work has been recognized with prestigious honors, including outstanding contribution to media.
Known for her ability to bring clarity to even the most complex topics, Amanda seamlessly blends innovation and creativity, inspiring readers to embrace the power of AI and emerging technologies. As a certified prompt engineer, she continues to push the boundaries of how humans and AI can work together.
Beyond her journalism career, Amanda is a long-distance runner and mom of three. She lives in New Jersey.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.










