AI models are getting better at grade school math — but a new study suggests they may be cheating
LLMs may be memorizing grade school math benchmarks rather than understanding the theory behind them. But the results are mixed.
Here at Tom’s Guide our expert editors are committed to bringing you the best news, reviews and guides to help you stay informed and ahead of the curve!
You are now subscribed
Your newsletter sign-up was successful
Want to add more newsletters?
Daily (Mon-Sun)
Tom's Guide Daily
Sign up to get the latest updates on all of your favorite content! From cutting-edge tech news and the hottest streaming buzz to unbeatable deals on the best products and in-depth reviews, we’ve got you covered.
Weekly on Thursday
Tom's AI Guide
Be AI savvy with your weekly newsletter summing up all the biggest AI news you need to know. Plus, analysis from our AI editor and tips on how to use the latest AI tools!
Weekly on Friday
Tom's iGuide
Unlock the vast world of Apple news straight to your inbox. With coverage on everything from exciting product launches to essential software updates, this is your go-to source for the latest updates on all the best Apple content.
Weekly on Monday
Tom's Streaming Guide
Our weekly newsletter is expertly crafted to immerse you in the world of streaming. Stay updated on the latest releases and our top recommendations across your favorite streaming platforms.
Join the club
Get full access to premium articles, exclusive features and a growing list of member rewards.
Large language models (LLMs) that power chatbots like ChatGPT may be getting better at answering benchmark questions that measure mathematical reasoning. But this may actually be a bad thing.
A pre-print research paper released on Wednesday by researchers at Scale AI detailed how LLMs have been achieving impressive results on math benchmark tests but that there’s growing concern that dataset contamination is fuelling high grades.
This is when data resembling benchmark questions leaks into training data. The LLM then may end up training in a way that prioritizes passing these standardized tests rather than truly understanding the mathematical problem it’s trying to solve.
This is similar to if you’re preparing for a math quiz by memorizing answers, rather than learning how to solve the problem. This issue is called overfitting.
However, the authors of the paper say their results don’t support this theory, suggesting that it doesn't mean the AI is bad at reasoning, just that it might not be as good as the benchmarks suggest..
Developing a new math benchmark
Data contamination is a huge problem for LLM evals right now. At Scale, we created a new test set for GSM8k *from scratch* to measure overfitting and found evidence that some models (most notably Mistral and Phi) do substantially worse on this new test set compared to GSM8k. pic.twitter.com/JgPQUaYsEcMay 2, 2024
In the paper the authors wrote: “The fact that a model is overfit does not mean that it is poor at reasoning, merely that it is not as good as the benchmarks might indicate it to be." They found that many of the most overfit models can still reason and solve problems they’ve never encountered before in their training sets.
To run these evaluations, they developed their own math benchmark test (GSM1k) which they say tests the AIs ability to understand the problem, not just the answer.
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
The fact that a model is overfit does not mean that it is poor at reasoning, merely that it is not as good as the benchmarks might indicate it to be.
Study authors
The questions are at grade school math level and a typical GSM1k question would look like: Jim wants to spend 15% of his monthly earnings on groceries. He makes $2500 a month. How much money will he have left over? The correct answer is $2125.
While such questions closely resemble those in the industry gold standard test (GSM8k) in difficulty, they're different enough to test whether the LLMs can handle math puzzles they haven’t seen before.
Using their new test, the research team at Scale AI reported accuracy drops of up to 13% when they evaluated leading open- and closed-source LLMs. Other models on the frontier such as Gemini, GPT, and Claude showed minimal signs of overfitting.
What's next?
Academic benchmarks are losing their potency. Moving forward, there’re 3 types of LLM evaluations that matter: 1. Privately held test set but publicly reported scores, by a trusted 3rd party who doesn’t have their own LLM to promote. @scale_AI’s latest GSM1k is a great example.… pic.twitter.com/j6a1Mf5biNMay 2, 2024
This ‘issue’ may end up resolving itself over time as the authors predict that by 2025 grade school math will likely no longer be difficult enough to benchmark new LLMs. Still, they say that improving reasoning in LLMs “is one of the most important directions of current research.”
Senior Research Scientist at NVIDIA Jim Fan said on X that academic benchmarks are losing their potency.
He said that three types of LLM evaluations that will matter in the future would be privately held tests like that of Scale AI, public comparative benchmarks like Chatbot Arena where you can test models side-by-side, and privately curated benchmarks for each company’s own use cases.
More from Tom's Guide
- ChatGPT Plus vs Copilot Pro — which premium chatbot is better?
- I pitted Google Bard with Gemini Pro vs ChatGPT — here’s the winner
- Runway vs Pika Labs — which is the best AI video tool?

Christoph Schwaiger is a journalist, mainly covering AI, health, and current affairs. His stories have been published by Tom's Guide, Live Science, New Scientist, and the Global Investigative Journalism Network, among other outlets. Christoph has appeared on LBC and Times Radio. Additionally, he previously served as a National President for Junior Chamber International (JCI), a global leadership organization, and graduated cum laude from the University of Groningen in the Netherlands with an MA in journalism. You can follow him on X (Twitter) @cschwaigermt.










