OpenAI is teaching AI models to 'confess' when they hallucinate — here’s what that actually means
It's supposed to improve failure detection, but it's not a magic fix
OpenAI wants its next generation of AI models to be a lot more upfront about their mistakes. With ChatGPT wrong about 25% of the time, this feature seems long overdue. But the company isn't training them to be more self-aware; it's training them to report errors directly.
This week, OpenAI published new research on a technique it's calling “confessions” — a method that adds a second output channel to a model, where it’s specifically trained to describe whether it followed the rules, where it may have fallen short or hallucinated and what uncertainties it faced during the task.
Here's the thing, though. It’s not a ChatGPT feature that's available yet to users; instead, it's a proof-of-concept safety tool designed to help researchers detect subtle failures that are otherwise hard to see. And according to early results highlighted in the study, it may actually work.
What “confessions” really are
Confessions are not the AI equivalent of a guilty conscience. They’re a trained behavior, created by giving the model a second task. The model starts by producing an answer, as usual. But then it produces a "ConfessionReport" evaluating the following:
- Accuracy of following each instruction
- Mentioning any shortcuts taken or if it “reward-hacked” the task
- Highlighting hallucinated details or unjustified assumptions
- Showing any encountered ambiguity or uncertainty of how to comply
Crucially, the confession is judged only on whether it honestly describes what happened and not whether it makes the model “look good.”
That means a model is rewarded for admitting a mistake, and not punished for exposing flaws in its own output. This reward structure is what makes the approach novel: it separates performance from honesty.
Anyone who has used ChatGPT or any other chatbot knows that one of the biggest problems with AI is that the model’s output can look perfectly fine while hiding a failure underneath. For example, the model may:
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
- Invent a fact
- Break a rule
- Overlook a key constraint
- Optimize for an unintended pattern
- Or rely on a faulty shortcut
These failures often go undetected because the answer itself doesn’t reveal them. And, most users don't notice because the model seems so confident in its answer.
OpenAI built a set of “stress tests” specifically designed to provoke these kinds of hidden errors, including hallucination traps, ambiguous instructions, and tasks where the model’s incentive is misaligned with correctness.
As stated on OpenAI's site, when confessions were added, the model surfaced far more cases where it had deviated from the instructions. According to the paper, the new method reduced undetected misbehavior to about 4.4% on average within those controlled test environments.
But what ChatGPT confessions still can't do is make AI models more truthful or reliable by default. In other words, they don’t eliminate hallucinations, reduce bias or prevent rule-breaking. Instead, they create a structured way for researchers to detect when those issues occur.
Bottom line
OpenAI’s “confessions” method doesn't mean your next prompt response will be any more accurate. It’s a research technique designed to make models better at reporting when they don’t follow instructions — not better at following them. And, at this time, it's only part of internal research.
The early results are promising, but they apply to controlled tests, not real-world conversations. Still, confessions could become an important part of how AI systems are evaluated as they get more capable, hopefully offering a new way to expose mistakes that ordinary outputs don’t reveal.
If this work continues to pay off, the next generation of AI assistants might tell you when they got something wrong. But don't hold your breath waiting for these models to be honest or accurate in the first place.
More from Tom's Guide
- Elon Musk's AI vs. Google's AI with 9 challenging prompts — here's the clear winner
- 11 underrated AI features that can save you serious time — and most are free
- I tested ChatGPT vs Gemini with 7 money-saving prompts — here’s the one that actually saved me more
Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.

Amanda Caswell is an award-winning journalist, bestselling YA author, and one of today’s leading voices in AI and technology. A celebrated contributor to various news outlets, her sharp insights and relatable storytelling have earned her a loyal readership. Amanda’s work has been recognized with prestigious honors, including outstanding contribution to media.
Known for her ability to bring clarity to even the most complex topics, Amanda seamlessly blends innovation and creativity, inspiring readers to embrace the power of AI and emerging technologies. As a certified prompt engineer, she continues to push the boundaries of how humans and AI can work together.
Beyond her journalism career, Amanda is a long-distance runner and mom of three. She lives in New Jersey.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.










