Google to Train AI Models on Reddit Posts. What Could Go Wrong?

Posts are infamously candid and can be offensive. Redditors think Google will have to train its models more carefully

February 23, 2024

2 Min Read

At a Glance

Nearly a year after deciding to charge for API access, Reddit reportedly has inked a $60 million a year deal with Google.
Reddit posts will be used to train Google AI models. But the content is infamously candid and can be offensive.

Reddit, the community discussion platform, is one of the more bizarre places on the internet, playing home to user posts on everything from mild memes to wild conspiracy theories.

But user posts may soon be used by Google to train its AI models as the search giant penned a data-training deal with Reddit. Reuters reported that Google signed an agreement worth $60 million per year to gain access to Reddit's user content. The posts will be used for training of Google's AI models.

Neither Google nor Reddit has commented publicly on the deal, but Reddit CEO Steve Huffman previously told The New York Times that the platform’s data corpus is “really valuable, but we don't need to give all of that value to some of the largest companies in the world for free.”

On Reddit, users retain any ownership rights to their content but Reddit can license usage of that content – to customers such as to Google. In response to the news, Redditors have begun posting complete gibberish in a bid to confuse AI systems with useless information.

What does this mean?

For Google, the deal provides another data source to further power its growing army of AI models. It unveiled a family of small open source models called Gemma just last week.

For Reddit, it provides another source of income ahead of its long-awaited IPO amid a slump in advertising revenue as competition heats up from social media newcomers like TikTok.

Last year, Reddit announced that it was charging access to its API. This used to be free and enabled users to create accessibility applications. It also was used by moderators of subreddits (mini-communities dedicated to certain topics) to create tools.

What could possibly go wrong?

On the whole, Reddit is home to a wide range of neutral user content spanning everything from gaming to recipes. But it is also known for its candor and Google could be training on data that is NSFW (Not Safe For Work) or outright offensive.

While Google’s AI developers will likely employ methods to avoid potentially perilous content, some posts could slip through the cracks.

Reddit users quickly picked up on this, with many saying in the R/Google subreddit that models will need to be trained to be safe and not toxic.

Some users jokingly likened future outputs to that of r/SubredditSimulator, a fully automated subreddit that generates random submissions and comments based on prior user content.

AI Business has reached out to Google and Reddit for comment.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

Google to Train AI Models on Reddit Posts. What Could Go Wrong?

At a Glance

What does this mean?

What could possibly go wrong?

About the Author(s)

Latest News

Trending articles