Google strikes $60m deal with Reddit for AI training data — what you need to know

Google headquarters in California
(Image credit: Achinthamb/Shutterstock)

Reddit spent the latter half of 2023 considering whether to block the Google and Bing search engines from indexing posts on the site. The decision, according to The Washington Post , was in order to prevent the unauthorized and uncompensated use of its posts to train AI. 

Now Reddit has announced it's reached a deal with Google that will, among other things, give the company access to the Reddit Data API “to improve its products and services” which includes “more efficient ways to train models”. In Google’s words, access to said API will grant the company “real-time, structured, unique content from their large and dynamic platform.” 

The deal, which Bloomberg previously suggested would be “worth about $60 million on an annualized basis”, doesn’t stop there. As part of the agreement, Reddit will have access to Google’s Vertex AI service which should improve internal search results, and it will also allow for “Reddit content to be displayed across Google products.” 

Google says this will ensure “more content-forward displays of Reddit information that will make our products more helpful for our users and make it easier to participate in Reddit communities and conversations.” Given the number of people who affix the word “reddit” to searches to surface genuine user-generated insights, that could be a very good thing to the average Google user.

But for Google, the real prize is undoubtedly the vast treasure trove of training data, which will theoretically make its generative AI appear more human, thanks to the posts and comments written by millions of real people every day.

For Google, the real prize is undoubtedly the vast treasure trove of training data, which will theoretically make its generative AI appear more human.

But scale isn’t everything, and in some ways Reddit is an imperfect sample for training artificial intelligence when compared to literature or magazines. Grammar is faster and looser, there’s a lot of memes and inside jokes, it’s full of information that’s just plain wrong and it's predominantly male.

Reddit logo and Reddit logo on phone

(Image credit: Shutterstock)

By contrast, Apple has reportedly sought multi-million dollar deals with publishers in order to train on their more formal and factually accurate magazines and newspapers. Though obviously this has its disadvantages too, concentrating on another small part of the human experience at the expense of how everyday people communicate — something Reddit is undoubtedly better at demonstrating.

Expect more of such deals to be made public over the next few years, because people are realizing that AI means big money and that training data can’t be absorbed free of charge without consequences. In the last year, Open AI, Meta and Stability AI have all been hit by lawsuits from authors who claim that their books were used for training without permission or compensation.

More from Tom's Guide

Alan Martin

Freelance contributor Alan has been writing about tech for over a decade, covering phones, drones and everything in between. Previously Deputy Editor of tech site Alphr, his words are found all over the web and in the occasional magazine too. When not weighing up the pros and cons of the latest smartwatch, you'll probably find him tackling his ever-growing games backlog. Or, more likely, playing Spelunky for the millionth time.

  • slightnitpick
    But scale isn’t everything, and in some ways Reddit is an imperfect sample for training artificial intelligence when compared to literature or magazines. Grammar is faster and looser, there’s a lot of memes and inside jokes, it’s full of information that’s just plain wrong and it's predominantly male.
    Not to mention the ad hoc moderation decisions. For interactions between people this moderation already makes what people say artificial.

    Will Google have access to deleted posts and comments as well? That would be the only saving grace, and even it has its limits.
    Reply