Claude AI training leak reveals trusted and banned websites — here’s what it means for you

Dario Amodei, Anthropic CEO
(Image credit: Getty Images)

A leaked internal document has exposed the data sources used to fine-tune Claude, Anthropic’s AI assistant, and it’s prompting new concerns about how today’s most powerful models are being shaped behind the scenes.

The document, reportedly created by third-party data-labeling firm Surge AI, included a list of websites that gig workers were instructed to use (and avoid) while helping Claude learn how to generate higher-quality responses.

The spreadsheet was stored in an open Google Drive folder and remained publicly accessible until Business Insider flagged it.

What the leak revealed

Claude on mobile

(Image credit: Future)

The spreadsheet included more than 120 “whitelisted” sites, such as:

  • Harvard.edu
  • Bloomberg
  • Mayo Clinic
  • The National Institutes of Health (NIH)

Those were the trusted sources that Surge AI workers could pull from when crafting prompts and answers during Claude’s reinforcement learning phase (known as RLHF).

But the document also listed 50+ “blacklisted” sites; places workers were explicitly told to avoid. That list included major publishers and platforms like:

  • The New York Times
  • Reddit
  • The Wall Street Journal
  • Stanford University
  • Wiley.com

Why were these sites off-limits? While we don't know for sure, it's most likely due to licensing or copyright concerns, particularly notable given Reddit’s recent lawsuit against Anthropic over alleged data misuse.

Why it matters

Anthropic shop

(Image credit: Anthropic)

Although the data was used for fine-tuning (not pre-training), the leak raises serious questions about data governance and legal risk in the AI industry.

Experts warn that courts may not draw a sharp line between training and fine-tuning data when evaluating potential copyright violations.

Surge AI quickly took the document offline after the leak was reported.

Anthropic, meanwhile, told Business Insider it had no knowledge of the list, which was reportedly created independently by its vendor.

Data control in the AI era

An open lock depicting a data breach

(Image credit: Shutterstock)

This isn’t the first time an AI vendor has mishandled sensitive training materials. Scale AI, another major player in the data-labeling space, faced a similar leak in past years.

But the stakes are higher now. With Anthropic valued at over $60 billion and Claude emerging as a top competitor to ChatGPT, every misstep invites scrutiny.

This event highlights a growing vulnerability in the AI ecosystem as companies rely more on human-supervised training, they also depend on third-party firms and those firms don’t always have airtight security or oversight.

What it means for you

Claude 4

(Image credit: NPowell/Flux-Kontext)

AI users need to understand that the quality, accuracy and even the ethical grounding of their chatbot’s responses are deeply tied to the data it's trained on and who decides what goes in or stays out.

This leak reveals that even top-tier models like Claude can be influenced by behind-the-scenes decisions made by third-party vendors.

When those choices involve inconsistent standards or unclear sourcing, it raises serious questions about bias, trust and accountability in the AI we rely on every day.

The takeaway

This leak is a glimpse into how major AI companies shape their models and the those guiding the process.

As AI becomes more embedded in everyday tools, trust will come down to transparency.

When it comes to this factor, it appears that there’s still a long way to go.

More from Tom's Guide

Category
Arrow
Arrow
Back to Laptops
Brand
Arrow
Processor
Arrow
RAM
Arrow
Storage Size
Arrow
Screen Size
Arrow
Colour
Arrow
Condition
Arrow
Screen Type
Arrow
Storage Type
Arrow
Price
Arrow
Any Price
Showing 10 of 132 deals
Filters
Arrow
Show more
Amanda Caswell
AI Writer

Amanda Caswell is an award-winning journalist, bestselling YA author, and one of today’s leading voices in AI and technology. A celebrated contributor to various news outlets, her sharp insights and relatable storytelling have earned her a loyal readership. Amanda’s work has been recognized with prestigious honors, including outstanding contribution to media.

Known for her ability to bring clarity to even the most complex topics, Amanda seamlessly blends innovation and creativity, inspiring readers to embrace the power of AI and emerging technologies. As a certified prompt engineer, she continues to push the boundaries of how humans and AI can work together.

Beyond her journalism career, Amanda is a bestselling author of science fiction books for young readers, where she channels her passion for storytelling into inspiring the next generation. A long-distance runner and mom of three, Amanda’s writing reflects her authenticity, natural curiosity, and heartfelt connection to everyday life — making her not just a journalist, but a trusted guide in the ever-evolving world of technology.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.