Claude AI training leak reveals trusted and banned websites — here’s what it means for you
Internal leak reveals Claude’s hidden data sources

A leaked internal document has exposed the data sources used to fine-tune Claude, Anthropic’s AI assistant, and it’s prompting new concerns about how today’s most powerful models are being shaped behind the scenes.
The document, reportedly created by third-party data-labeling firm Surge AI, included a list of websites that gig workers were instructed to use (and avoid) while helping Claude learn how to generate higher-quality responses.
The spreadsheet was stored in an open Google Drive folder and remained publicly accessible until Business Insider flagged it.
What the leak revealed
The spreadsheet included more than 120 “whitelisted” sites, such as:
- Harvard.edu
- Bloomberg
- Mayo Clinic
- The National Institutes of Health (NIH)
Those were the trusted sources that Surge AI workers could pull from when crafting prompts and answers during Claude’s reinforcement learning phase (known as RLHF).
But the document also listed 50+ “blacklisted” sites; places workers were explicitly told to avoid. That list included major publishers and platforms like:
- The New York Times
- The Wall Street Journal
- Stanford University
- Wiley.com
Why were these sites off-limits? While we don't know for sure, it's most likely due to licensing or copyright concerns, particularly notable given Reddit’s recent lawsuit against Anthropic over alleged data misuse.
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
Why it matters
Although the data was used for fine-tuning (not pre-training), the leak raises serious questions about data governance and legal risk in the AI industry.
Experts warn that courts may not draw a sharp line between training and fine-tuning data when evaluating potential copyright violations.
Surge AI quickly took the document offline after the leak was reported.
Anthropic, meanwhile, told Business Insider it had no knowledge of the list, which was reportedly created independently by its vendor.
Data control in the AI era
This isn’t the first time an AI vendor has mishandled sensitive training materials. Scale AI, another major player in the data-labeling space, faced a similar leak in past years.
But the stakes are higher now. With Anthropic valued at over $60 billion and Claude emerging as a top competitor to ChatGPT, every misstep invites scrutiny.
This event highlights a growing vulnerability in the AI ecosystem as companies rely more on human-supervised training, they also depend on third-party firms and those firms don’t always have airtight security or oversight.
What it means for you
AI users need to understand that the quality, accuracy and even the ethical grounding of their chatbot’s responses are deeply tied to the data it's trained on and who decides what goes in or stays out.
This leak reveals that even top-tier models like Claude can be influenced by behind-the-scenes decisions made by third-party vendors.
When those choices involve inconsistent standards or unclear sourcing, it raises serious questions about bias, trust and accountability in the AI we rely on every day.
The takeaway
This leak is a glimpse into how major AI companies shape their models and the those guiding the process.
As AI becomes more embedded in everyday tools, trust will come down to transparency.
When it comes to this factor, it appears that there’s still a long way to go.
More from Tom's Guide
- Sam Altman gives stern warning on AI, fraud and passwords — 'That is a crazy thing to still be doing'
- Claude vs ChatGPT explained: What each AI does best — and how to choose the right one
- ChatGPT now handles 2.5 billion prompts a day — and it’s changing how we search













Amanda Caswell is an award-winning journalist, bestselling YA author, and one of today’s leading voices in AI and technology. A celebrated contributor to various news outlets, her sharp insights and relatable storytelling have earned her a loyal readership. Amanda’s work has been recognized with prestigious honors, including outstanding contribution to media.
Known for her ability to bring clarity to even the most complex topics, Amanda seamlessly blends innovation and creativity, inspiring readers to embrace the power of AI and emerging technologies. As a certified prompt engineer, she continues to push the boundaries of how humans and AI can work together.
Beyond her journalism career, Amanda is a bestselling author of science fiction books for young readers, where she channels her passion for storytelling into inspiring the next generation. A long-distance runner and mom of three, Amanda’s writing reflects her authenticity, natural curiosity, and heartfelt connection to everyday life — making her not just a journalist, but a trusted guide in the ever-evolving world of technology.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.