Claude AI training leak reveals trusted and banned websites — here’s what it means for you

(Image credit: Getty Images)

A leaked internal document has exposed the data sources used to fine-tune Claude, Anthropic’s AI assistant, and it’s prompting new concerns about how today’s most powerful models are being shaped behind the scenes.

The document, reportedly created by third-party data-labeling firm Surge AI, included a list of websites that gig workers were instructed to use (and avoid) while helping Claude learn how to generate higher-quality responses.

The spreadsheet was stored in an open Google Drive folder and remained publicly accessible until Business Insider flagged it.

What the leak revealed

Claude on mobile — (Image credit: Future)

The spreadsheet included more than 120 “whitelisted” sites, such as:

Harvard.edu
Bloomberg
Mayo Clinic
The National Institutes of Health (NIH)

Those were the trusted sources that Surge AI workers could pull from when crafting prompts and answers during Claude’s reinforcement learning phase (known as RLHF).

But the document also listed 50+ “blacklisted” sites; places workers were explicitly told to avoid. That list included major publishers and platforms like:

The New York Times
Reddit
The Wall Street Journal
Stanford University
Wiley.com

Why were these sites off-limits? While we don't know for sure, it's most likely due to licensing or copyright concerns, particularly notable given Reddit’s recent lawsuit against Anthropic over alleged data misuse.

Why it matters

Anthropic shop — (Image credit: Anthropic)

Although the data was used for fine-tuning (not pre-training), the leak raises serious questions about data governance and legal risk in the AI industry.

Experts warn that courts may not draw a sharp line between training and fine-tuning data when evaluating potential copyright violations.

Surge AI quickly took the document offline after the leak was reported.

Anthropic, meanwhile, told Business Insider it had no knowledge of the list, which was reportedly created independently by its vendor.

Data control in the AI era

An open lock depicting a data breach — (Image credit: Shutterstock)

This isn’t the first time an AI vendor has mishandled sensitive training materials. Scale AI, another major player in the data-labeling space, faced a similar leak in past years.

But the stakes are higher now. With Anthropic valued at over $60 billion and Claude emerging as a top competitor to ChatGPT, every misstep invites scrutiny.

This event highlights a growing vulnerability in the AI ecosystem as companies rely more on human-supervised training, they also depend on third-party firms and those firms don’t always have airtight security or oversight.

What it means for you

Claude 4 — (Image credit: NPowell/Flux-Kontext)

AI users need to understand that the quality, accuracy and even the ethical grounding of their chatbot’s responses are deeply tied to the data it's trained on and who decides what goes in or stays out.

This leak reveals that even top-tier models like Claude can be influenced by behind-the-scenes decisions made by third-party vendors.

When those choices involve inconsistent standards or unclear sourcing, it raises serious questions about bias, trust and accountability in the AI we rely on every day.

The takeaway

This leak is a glimpse into how major AI companies shape their models and the those guiding the process.

As AI becomes more embedded in everyday tools, trust will come down to transparency.

When it comes to this factor, it appears that there’s still a long way to go.

More from Tom's Guide

Back to Laptops

Apple

Asus

Dell

Lenovo

AMD Ryzen

AMD Ryzen 7

Intel Core i3

Intel Core i5

Intel Core i7

8GB RAM

16GB RAM

24GB RAM

32GB RAM

32GB

64GB

128GB

256GB

512GB

1TB

2TB

4TB

13.3-inch

13.4-inch

14-inch

15-inch

Black

Blue

Gold

Purple

Silver

White

New

Refurbished

LED

OLED

EMMC

SSD

Showing 10 of 132 deals

Filters☰

Apple 13" MacBook Air M4 (2025)

(256GB Blue)

$999

$849

Preorder

Apple 15" MacBook Air M4 (2025)

(15-inch 1TB)

$1,749

View

Dell XPS 13 (2016)

Our Review

☆☆☆☆☆

$569

View

Lenovo Yoga Slim 7x (Gen 9)

(512GB OLED)

$1,075.79

$858.11

View

Lenovo IdeaPad Flex 5i ChromeBook Plus

(14-inch 2TB)

$479.99

View

Asus ROG Zephyrus G14 (2024)

(14-inch 1TB)

Our Review

☆☆☆☆☆

$1,849

View

Apple 13" MacBook Air M4 (2025)

$849

View

Apple 15" MacBook Air M4 (2025)

(16GB RAM SSD)

(13.4-inch 512GB)

Lenovo Yoga Slim 7x (Gen 9)

(Blue)

Amanda Caswell is an award-winning journalist, bestselling YA author, and one of today’s leading voices in AI and technology. A celebrated contributor to various news outlets, her sharp insights and relatable storytelling have earned her a loyal readership. Amanda’s work has been recognized with prestigious honors, including outstanding contribution to media.

Known for her ability to bring clarity to even the most complex topics, Amanda seamlessly blends innovation and creativity, inspiring readers to embrace the power of AI and emerging technologies. As a certified prompt engineer, she continues to push the boundaries of how humans and AI can work together.

Beyond her journalism career, Amanda is a bestselling author of science fiction books for young readers, where she channels her passion for storytelling into inspiring the next generation. A long-distance runner and mom of three, Amanda’s writing reflects her authenticity, natural curiosity, and heartfelt connection to everyday life — making her not just a journalist, but a trusted guide in the ever-evolving world of technology.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.