AI safety tests are heavily flawed, new study finds — here’s why that could be a huge problem

AI chatbot images on a phone screen — (Image credit: Getty Images)

A new study into the testing procedure behind common AI models has reached some worrying conclusions.

The joint investigation between U.S. and U.K researchers examined data from over 440 benchmarking tests used to measure an AI's ability to resolve problems and determine safety parameters. They reported flaws in these tests that undermine the credibility of these models.

According to the study, the flaws are due to these benchmarks being built on unclear definitions or weak analytical methods, making it difficult to accurately make assessments of the model’s abilities or AI progress.

What does this mean for AI?

Graphical representation of a cybernetic brain — (Image credit: Shutterstock)

The safety of AI models is a problem that has been up for debate for a while now. In the past, companies like OpenAI and Google have launched their models without completing safety reports.

Elsewhere, models have been launched after scoring highly in a range of benchmarking tests, only to fail when released to the public.

Google recently withdrew one of its latest models, Gamma, after it made false allegations about a U.S. senator, and similar issues have occurred in the past, such as xAI’s Grok hallucinating conspiracy theories.

What’s the solution?

The study was carried out by researchers from the University of California, Berkley and the University of Oxford in the U.K. The team made eight recommendations to AI companies to solve the issues they raised:

Define and isolate: Provide a precise, operational definition for the concept being measured and control for unrelated factors.
Build representative evaluations: Ensure test items represent real-world conditions and cover the full scope of the target skill or behaviour.
Strengthen analysis and justification: Use statistical methods to report uncertainty and enable robust comparisons; conduct detailed error analysis to understand why a model fails; and justify why the benchmark is a valid measure for its intended purpose.

They also provided a checklist that any benchmarkers can use to test if their own tests are up to scratch.

Whether or not the AI companies take these recommendations on board remains to be seen.

Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.

More from Tom's Guide

Back to Laptops

Apple

Asus

Dell

Lenovo

AMD Ryzen

Intel Core i5

Intel Core i7

8GB RAM

16GB RAM

24GB RAM

32GB RAM

32GB

64GB

128GB

256GB

512GB

1TB

2TB

13.3-inch

13.4-inch

14-inch

15-inch

Black

Blue

Brown

Gold

Grey

Silver

New

Refurbished

Showing 10 of 140 deals

Filters☰

Apple 13" MacBook Air M4 (2025)

(256GB SSD)

$999

$849.99

View

Apple 15" MacBook Air M4 (2025)

(15-inch 256GB)

(13.3-inch 256GB)

Our Review

☆☆☆☆☆

$755

View

Lenovo Yoga Slim 7x (Gen 9)

(512GB OLED)

$1,075.79

$858.11

View

Lenovo Chromebook Plus 14

(Grey)

Our Review

☆☆☆☆☆

$639.99

$549.99

View

Asus ROG Zephyrus G14 (2025)

(14-inch 1TB)

Our Review

☆☆☆☆☆

$1,799.99

View

Apple 13" MacBook Air M4 (2025)

(256GB SSD)

$999

View

Apple 15" MacBook Air M4 (2025)

(15-inch 256GB)

(Intel Core i5)

Lenovo Yoga Slim 7x (Gen 9)

(Blue)

$1,439.99

$1,099.99

View

Alex is the AI editor at TomsGuide. Dialed into all things artificial intelligence in the world right now, he knows the best chatbots, the weirdest AI image generators, and the ins and outs of one of tech’s biggest topics.

Before joining the Tom’s Guide team, Alex worked for the brands TechRadar and BBC Science Focus.

He was highly commended in the Specialist Writer category at the BSME's 2023 and was part of a team to win best podcast at the BSME's 2025.

In his time as a journalist, he has covered the latest in AI and robotics, broadband deals, the potential for alien life, the science of being slapped, and just about everything in between.

When he’s not trying to wrap his head around the latest AI whitepaper, Alex pretends to be a capable runner, cook, and climber.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

Welcome to the Tom's Guide Club !

Hi ,

Earn Your First Badge

Complete 1 quiz to unlock your first badge.

Keep earning badges

Explore ways to get more involved as a member.

See what you’ve unlocked.

Members Exclusive

AI safety tests are heavily flawed, new study finds — here’s why that could be a huge problem

What does this mean for AI?

What’s the solution?

More from Tom's Guide

GET TG ACCESS QUICK

Welcome to the Tom's Guide Club !

Hi ,

Earn Your First Badge

Complete 1 quiz to unlock your first badge.

Keep earning badges

Explore ways to get more involved as a member.

See what you’ve unlocked.

Members Exclusive

What does this mean for AI?

What’s the solution?

More from Tom's Guide