AI safety tests are heavily flawed, new study finds — here’s why that could be a huge problem

AI chatbot images on a phone screen
(Image credit: Getty Images)

A new study into the testing procedure behind common AI models has reached some worrying conclusions.

The joint investigation between U.S. and U.K researchers examined data from over 440 benchmarking tests used to measure an AI's ability to resolve problems and determine safety parameters. They reported flaws in these tests that undermine the credibility of these models.

According to the study, the flaws are due to these benchmarks being built on unclear definitions or weak analytical methods, making it difficult to accurately make assessments of the model’s abilities or AI progress.

What does this mean for AI?

Graphical representation of a cybernetic brain

(Image credit: Shutterstock)

The safety of AI models is a problem that has been up for debate for a while now. In the past, companies like OpenAI and Google have launched their models without completing safety reports.

Elsewhere, models have been launched after scoring highly in a range of benchmarking tests, only to fail when released to the public.

Google recently withdrew one of its latest models, Gamma, after it made false allegations about a U.S. senator, and similar issues have occurred in the past, such as xAI’s Grok hallucinating conspiracy theories.

What’s the solution?

The study was carried out by researchers from the University of California, Berkley and the University of Oxford in the U.K. The team made eight recommendations to AI companies to solve the issues they raised:

  • Define and isolate: Provide a precise, operational definition for the concept being measured and control for unrelated factors.
  • Build representative evaluations: Ensure test items represent real-world conditions and cover the full scope of the target skill or behaviour.
  • Strengthen analysis and justification: Use statistical methods to report uncertainty and enable robust comparisons; conduct detailed error analysis to understand why a model fails; and justify why the benchmark is a valid measure for its intended purpose.

They also provided a checklist that any benchmarkers can use to test if their own tests are up to scratch.

Whether or not the AI companies take these recommendations on board remains to be seen.


Google News

Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.


More from Tom's Guide

Category
Arrow
Arrow
Back to Laptops
Brand
Arrow
Processor
Arrow
RAM
Arrow
Storage Size
Arrow
Screen Size
Arrow
Colour
Arrow
Condition
Arrow
Minimum Price
Arrow
Any Minimum Price
Maximum Price
Arrow
Any Maximum Price
Showing 10 of 98 deals
Filters
Arrow
Show more
Alex Hughes
AI Editor

Alex is the AI editor at TomsGuide. Dialed into all things artificial intelligence in the world right now, he knows the best chatbots, the weirdest AI image generators, and the ins and outs of one of tech’s biggest topics.

Before joining the Tom’s Guide team, Alex worked for the brands TechRadar and BBC Science Focus.

He was highly commended in the Specialist Writer category at the BSME's 2023 and was part of a team to win best podcast at the BSME's 2025.

In his time as a journalist, he has covered the latest in AI and robotics, broadband deals, the potential for alien life, the science of being slapped, and just about everything in between.

When he’s not trying to wrap his head around the latest AI whitepaper, Alex pretends to be a capable runner, cook, and climber.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.