ChatGPT keeps making stuff up — a new study explains the reason

(Image credit: Shutterstock)

The other day I was brainstorming with ChatGPT and all of a sudden it went into this long, fantasy story that had nothing to do with my queries. It was so ridiculous that it made me laugh. Lately, I haven't seen mistakes like this as often with text prompts, but I still see them pretty regularly with image generation.

These random moments when a chatbot strays from the task are known as hallucinations. What's odd is that the chatbot is so confident about the wrong answer it's giving; one of the biggest weakness of today's AI assistants. However, a new study from OpenAI argues these failures aren’t random, but a direct result of how models are trained and evaluated.

Why chatbots keep guessing when they shouldn’t

ChatGPT logo on phone in front of robot thinking

(Image credit: Shutterstock)

Research points to a structural issue causing hallucinations; essentially the problem stems from benchmarks and leaderboards ranking AI models and rewarding confident answers.

In other words, when a chatbot says “I don’t know,” it gets penalized in testing. That means the models are effectively encouraged to always provide an answer, even if they’re not sure it’s right.

In practice, that makes your AI assistant more likely to guess than admit uncertainty. For everyday queries, this can be harmless. But in higher-stakes cases, from medical questions to financial advice, those confident errors can quickly turn dangerous.

As a power user, that's why I always fact-check and ask the chatbot to cite the source. Sometimes if the information seems too far-fetched and I ask for a source, the chatbot will say something like, "Good catch!" or something similar, still not admitting it was wrong.

Newer models aren’t immune

ChatGPT-5 Image on a keyboard — (Image credit: ChatGPT AI generated image)

Interestingly, OpenAI’s paper found that reasoning-focused models like o3 and o4-mini actually hallucinate more often than some older models. Why? Because they produce more claims overall, which means more chances to be wrong.

So, if a model is “smarter” at reasoning, it really doesn’t make it more honest about what it doesn’t know.

What can fix this problem?

Person coding on computer — (Image credit: Shutterstock)

Researchers argue that the solution is to change how we score and benchmark AI. Instead of punishing models for saying “I’m not sure.” The most valuable tests should reward calibrated responses, uncertainty flags or the ability to defer to other sources.

That could mean your future chatbot might hedge more often, less “here’s the answer” and more “here’s what I think, but I’m not certain.” It may feel slower, but it could dramatically reduce harmful errors. Proving that crtitical thinking on our part is still important.

Why it matters for you

Person typing on laptop keyboard — (Image credit: Unsplash)

If you’re using popular chatbots including ChatGPT, Gemini, Claude or Grok you’ve almost certainly seen a hallucination. This research suggests it’s not entirely the model’s fault, but the way they are tested; as if it's a game testing which can be right most often.

For users, that means we need to be diligent and consider AI answers as a first suggestion and not the final word. And for developers, this is a sign that it's time to rethink how we measure success so that future AI assistants can admit what they don’t know instead of getting things completely wrong.

Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button!

More from Tom's Guide

Back to Laptops

Apple

Asus

Dell

Lenovo

AMD Ryzen

Intel Core i5

Intel Core i7

8GB RAM

16GB RAM

24GB RAM

32GB RAM

32GB

64GB

128GB

256GB

512GB

1TB

2TB

13.3-inch

13.4-inch

14-inch

15-inch

Black

Blue

Brown

Gold

Grey

Silver

New

Refurbished

Showing 10 of 136 deals

Filters☰

Apple 13" MacBook Air M4 (2025)

(256GB SSD)

$849.99

View

Apple 15" MacBook Air M4 (2025)

(15-inch 256GB)

$1,049

View

Dell XPS 13 (2016)

(13.3-inch 256GB)

Our Review

☆☆☆☆☆

$755

View

Lenovo Yoga Slim 7x (Gen 9)

(512GB OLED)

$1,075.79

$858.11

View

Lenovo Chromebook Plus 14

(Grey)

Our Review

☆☆☆☆☆

$639.99

$549.99

View

Asus ROG Zephyrus G14 (2025)

(14-inch 1TB)

Our Review

☆☆☆☆☆

$1,799.99

View

Apple 13" MacBook Air M4 (2025)

(256GB SSD)

$999

View

Apple 15" MacBook Air M4 (2025)

(15-inch 256GB)

(Intel Core i5)

Lenovo Yoga Slim 7x (Gen 9)

(Blue)

$1,439.99

$999.99

View

TOPICS

Amanda Caswell is an award-winning journalist, bestselling YA author, and one of today’s leading voices in AI and technology. A celebrated contributor to various news outlets, her sharp insights and relatable storytelling have earned her a loyal readership. Amanda’s work has been recognized with prestigious honors, including outstanding contribution to media.

Known for her ability to bring clarity to even the most complex topics, Amanda seamlessly blends innovation and creativity, inspiring readers to embrace the power of AI and emerging technologies. As a certified prompt engineer, she continues to push the boundaries of how humans and AI can work together.

Beyond her journalism career, Amanda is a long-distance runner and mom of three. She lives in New Jersey.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Welcome to the Tom's Guide Club !

Hi ,

Earn Your First Badge

Complete 1 quiz to unlock your first badge.

Keep earning badges

Explore ways to get more involved as a member.

See what you’ve unlocked.

Members Exclusive

Study finds ChatGPT-5 is wrong about 1 in 4 times — here's the reason why

Why chatbots keep guessing when they shouldn’t

Newer models aren’t immune

What can fix this problem?

Why it matters for you

More from Tom's Guide

GET TG ACCESS QUICK

Welcome to the Tom's Guide Club !

Hi ,

Earn Your First Badge

Complete 1 quiz to unlock your first badge.

Keep earning badges

Explore ways to get more involved as a member.

See what you’ve unlocked.

Members Exclusive

Why chatbots keep guessing when they shouldn’t

Newer models aren’t immune

What can fix this problem?

Why it matters for you

More from Tom's Guide