'75% of web pages are AI-generated' — AI CEO explains why companies are desperate for 'real' human data

man sat at darkened desk working on laptop and desktop — (Image credit: Shutterstock)

It’s no longer surprising that AI companies rely on real human data to train and improve their models — but just how much of it they use might be.

From tech giants to everyday apps, the demand for human-generated data is exploding. Companies like OpenAI aren’t alone. Businesses outside the AI space, including DoorDash, are also tapping into real-world user data to refine their systems and stay competitive.

Real human data is becoming one of the most valuable assets in AI

AI tools floating out of laptops — (Image credit: Shutterstock)

The demand for real-world video, in particular, is surging. According to Troveo CEO Marty Pesis, AI models need more than synthetic inputs to truly understand how people behave.

“The demand for real-world video is accelerating because AI companies need grounded examples of how people actually move, behave, and interact in real environments,” he said. “Simulated and synthetic data don’t fully capture the unpredictability of real life.”

That push is already showing up in how companies collect data. DoorDash recently introduced an optional program called “DoorDash Tasks,” which pays delivery drivers to record themselves completing everyday activities. The goal is simple: give AI a better understanding of the physical world through real human behavior.

But as more companies turn to human-generated data, consent is becoming a bigger part of the conversation.

“Consent is central for two reasons,” Pesis explained. “Companies need to know they have the legal right to use the data for AI training, and they need confidence it actually came from real people.”

That second point is becoming increasingly important as AI-generated content floods the internet. Some estimates suggest nearly 75% of newly created web pages now include AI-generated material — a number that continues to rise.

So what makes human data truly valuable?

According to Pesis, it comes down to quality. “High-value training data is accurately labeled, technically consistent, and representative,” he said. In practice, that means data needs to be standardized so it can scale — and diverse enough to reflect real-world conditions, from lighting and camera angles to the many ways people actually move and interact.

Companies like Anthropic, Apple and Superhuman (formerly Grammarly) stand out among the large group of companies that use the text, audio and video data produced by their human users to train AI models.

It’s easy to predict that more companies we use on the regular will join in on that trend—the biggest worry is that these companies will do it without our consent. Here’s hoping that we’ll have the ability to opt out of those practices as they begin popping up more regularly.

Click to follow Tom's Guide on Google News

Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.

More from Tom’s Guide

Showing 10 of 10 deals

Filters☰

(256GB SSD)

(256GB Silver)

(256GB SSD)

(256GB Silver)

TOPICS

Elton Jones covers AI for Tom’s Guide, and tests all the latest models, from ChatGPT to Gemini to Claude to see which tools perform best — and how they can improve everyday productivity.

He is also an experienced tech writer who has covered video games, mobile devices, headsets, and now artificial intelligence for over a decade. Since 2011, his work has appeared in publications including The Christian Post, Complex, TechRadar, Heavy, and ONE37pm, with a focus on clear, practical analysis.

Today, Elton focuses on making AI more accessible by breaking down complex topics into useful, easy-to-understand insights for a wide range of readers.

Real human data is becoming one of the most valuable assets in AI

Human consent for AI training should be at the heart of this growing trend

More from Tom’s Guide