OpenAI is teaching AI models to 'confess' when they hallucinate — here’s what that actually means

OpenAI logo with robotic human head
(Image credit: Shutterstock)

OpenAI wants its next generation of AI models to be a lot more upfront about their mistakes. With ChatGPT wrong about 25% of the time, this feature seems long overdue. But the company isn't training them to be more self-aware; it's training them to report errors directly.

This week, OpenAI published new research on a technique it's calling “confessions” — a method that adds a second output channel to a model, where it’s specifically trained to describe whether it followed the rules, where it may have fallen short or hallucinated and what uncertainties it faced during the task.

Here's the thing, though. It’s not a ChatGPT feature that's available yet to users; instead, it's a proof-of-concept safety tool designed to help researchers detect subtle failures that are otherwise hard to see. And according to early results highlighted in the study, it may actually work.

What “confessions” really are

Sam Altman at a press conference

(Image credit: Getty Images)

Confessions are not the AI equivalent of a guilty conscience. They’re a trained behavior, created by giving the model a second task. The model starts by producing an answer, as usual. But then it produces a "ConfessionReport" evaluating the following:

  • Accuracy of following each instruction
  • Mentioning any shortcuts taken or if it “reward-hacked” the task
  • Highlighting hallucinated details or unjustified assumptions
  • Showing any encountered ambiguity or uncertainty of how to comply

Crucially, the confession is judged only on whether it honestly describes what happened and not whether it makes the model “look good.”

That means a model is rewarded for admitting a mistake, and not punished for exposing flaws in its own output. This reward structure is what makes the approach novel: it separates performance from honesty.

Anyone who has used ChatGPT or any other chatbot knows that one of the biggest problems with AI is that the model’s output can look perfectly fine while hiding a failure underneath. For example, the model may:

  • Invent a fact
  • Break a rule
  • Overlook a key constraint
  • Optimize for an unintended pattern
  • Or rely on a faulty shortcut

ChatGPT running on an iPhone

(Image credit: Shutterstock)

These failures often go undetected because the answer itself doesn’t reveal them. And, most users don't notice because the model seems so confident in its answer.

OpenAI built a set of “stress tests” specifically designed to provoke these kinds of hidden errors, including hallucination traps, ambiguous instructions, and tasks where the model’s incentive is misaligned with correctness.

As stated on OpenAI's site, when confessions were added, the model surfaced far more cases where it had deviated from the instructions. According to the paper, the new method reduced undetected misbehavior to about 4.4% on average within those controlled test environments.

But what ChatGPT confessions still can't do is make AI models more truthful or reliable by default. In other words, they don’t eliminate hallucinations, reduce bias or prevent rule-breaking. Instead, they create a structured way for researchers to detect when those issues occur.

Bottom line

OpenAI’s “confessions” method doesn't mean your next prompt response will be any more accurate. It’s a research technique designed to make models better at reporting when they don’t follow instructions — not better at following them. And, at this time, it's only part of internal research.

The early results are promising, but they apply to controlled tests, not real-world conversations. Still, confessions could become an important part of how AI systems are evaluated as they get more capable, hopefully offering a new way to expose mistakes that ordinary outputs don’t reveal.

If this work continues to pay off, the next generation of AI assistants might tell you when they got something wrong. But don't hold your breath waiting for these models to be honest or accurate in the first place.

More from Tom's Guide


Google News

Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.


Category
Arrow
Arrow
Back to Laptops
Brand
Arrow
Processor
Arrow
RAM
Arrow
Storage Size
Arrow
Screen Size
Arrow
Colour
Arrow
Condition
Arrow
Price
Arrow
Any Price
Showing 10 of 102 deals
Filters
Arrow
Show more
TOPICS
Amanda Caswell
AI Editor

Amanda Caswell is an award-winning journalist, bestselling YA author, and one of today’s leading voices in AI and technology. A celebrated contributor to various news outlets, her sharp insights and relatable storytelling have earned her a loyal readership. Amanda’s work has been recognized with prestigious honors, including outstanding contribution to media.

Known for her ability to bring clarity to even the most complex topics, Amanda seamlessly blends innovation and creativity, inspiring readers to embrace the power of AI and emerging technologies. As a certified prompt engineer, she continues to push the boundaries of how humans and AI can work together.

Beyond her journalism career, Amanda is a long-distance runner and mom of three. She lives in New Jersey.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.