Reddit Blocks AI Crawlers, Protects Data From Free Access

Reddit takes steps to protect its user data by requiring AI companies to pay for access

Ben Wodecki, Jr. Editor

June 27, 2024

2 Min Read
The Reddit logo displayed on a tablet and a smartphone
LIONEL BONAVENTURE/AFP via Getty Images

Reddit has taken steps to protect its valuable user-generated content from AI companies' web crawlers, updating its backend to restrict access to the platform's data.

The social network platform announced it would update its Robots Exclusion Protocol (robots.txt file) to prevent external sources from crawling information from the site.

Web crawlers like OpenAI’s GPTBot scrape thousands of pages across the internet, gathering masses of data nonstop from a couple of days to a few weeks. In the AI world, model makers do this to gather data — often without the platform owner’s permission.

The practice of web crawling is becoming increasingly frowned upon as rightsholders become protective of their content.

Reddit’s decision to block crawlers comes as it’s looking to protect a lucrative asset: data.

The platform has struck deals with AI developers including Google and OpenAI, providing them access to a trove of user posts in exchange for cash.

Reddit’s Google deal was worth a reported $60 million per year.

Reddit generated $810 million in 2023, primarily from advertising, according to Business of Apps. However, Reddit recently looked for other ways to generate money including charging third parties for access to its API, a move met with rage by users last June.

Related:Google Refines AI Overviews After Bizarre Responses, Limits Content

By restricting crawlers from scraping the platform, AI developers wanting to train their models on Reddit content would be forced to pay for a license. 

“We are selective about who we work with and trust with large-scale access to Reddit content,” according to a company announcement. “Anyone accessing Reddit content must abide by our policies, including those in place to protect Redditors.”

There are some non-commercial exemptions, however, allowing researchers and archival organizations such as the Internet Archive to access Reddit content.

“The Internet Archive is grateful that Reddit appreciates the importance of helping to ensure the digital records of our times are archived and preserved for future generations to enjoy and learn from,” said Mark Graham director of the Internet Archive’s Wayback Machine. “Working in collaboration with Reddit we will continue to record and make available archives of Reddit, along with the hundreds of millions of URLs from other sites we archive every day.”
Google’s use of Reddit content didn’t go too smoothly, as its AI-powered search feature Overviews had to be revamped after it accidentally responded to user queries with absurd responses from Reddit, like suggesting jumping off the Golden Gate Bridge as a cure for depression.

Related:Google to Train AI Models on Reddit Posts. What Could Go Wrong?

Read more about:

ChatGPT / Generative AI

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like