AI safety tests are heavily flawed, new study finds — here’s why that could be a huge problem
Are AI models being tested correctly?
A new study into the testing procedure behind common AI models has reached some worrying conclusions.
The joint investigation between U.S. and U.K researchers examined data from over 440 benchmarking tests used to measure an AI's ability to resolve problems and determine safety parameters. They reported flaws in these tests that undermine the credibility of these models.
According to the study, the flaws are due to these benchmarks being built on unclear definitions or weak analytical methods, making it difficult to accurately make assessments of the model’s abilities or AI progress.
“Benchmarks underpin nearly all claims about advances in AI,” said Andrew Bean, lead author of the study. “But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.”
Currently, there is no clear regulation on AI models. Instead, they are tested on a wide range of benchmark examinations, such as their ability to solve common logic problems or tests on whether they can be blackmailed.
These tests allow AI companies to see where their models fall down and make improvements based on these results in the next iteration. They are also typically the measurement used in policy or regulation decisions.
What does this mean for AI?
The safety of AI models is a problem that has been up for debate for a while now. In the past, companies like OpenAI and Google have launched their models without completing safety reports.
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
Elsewhere, models have been launched after scoring highly in a range of benchmarking tests, only to fail when released to the public.
Google recently withdrew one of its latest models, Gamma, after it made false allegations about a U.S. senator, and similar issues have occurred in the past, such as xAI’s Grok hallucinating conspiracy theories.
What’s the solution?
The study was carried out by researchers from the University of California, Berkley and the University of Oxford in the U.K. The team made eight recommendations to AI companies to solve the issues they raised:
- Define and isolate: Provide a precise, operational definition for the concept being measured and control for unrelated factors.
- Build representative evaluations: Ensure test items represent real-world conditions and cover the full scope of the target skill or behaviour.
- Strengthen analysis and justification: Use statistical methods to report uncertainty and enable robust comparisons; conduct detailed error analysis to understand why a model fails; and justify why the benchmark is a valid measure for its intended purpose.
They also provided a checklist that any benchmarkers can use to test if their own tests are up to scratch.
Whether or not the AI companies take these recommendations on board remains to be seen.
Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.
More from Tom's Guide
- How to colorize black and white photos with Google Gemini — no Photoshop required
- ChatGPT-5.1 vs Claude 4.5 Sonnet — I ran 9 tests to find the most creative assistant
- ChatGPT leaked private chats into Google Search (again) — here’s how to protect your data

Alex is the AI editor at TomsGuide. Dialed into all things artificial intelligence in the world right now, he knows the best chatbots, the weirdest AI image generators, and the ins and outs of one of tech’s biggest topics.
Before joining the Tom’s Guide team, Alex worked for the brands TechRadar and BBC Science Focus.
He was highly commended in the Specialist Writer category at the BSME's 2023 and was part of a team to win best podcast at the BSME's 2025.
In his time as a journalist, he has covered the latest in AI and robotics, broadband deals, the potential for alien life, the science of being slapped, and just about everything in between.
When he’s not trying to wrap his head around the latest AI whitepaper, Alex pretends to be a capable runner, cook, and climber.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.










