Reddit Blocks AI Crawlers, Protects Data From Free Access
Reddit takes steps to protect its user data by requiring AI companies to pay for access
Reddit has taken steps to protect its valuable user-generated content from AI companies' web crawlers, updating its backend to restrict access to the platform's data.
The social network platform announced it would update its Robots Exclusion Protocol (robots.txt file) to prevent external sources from crawling information from the site.
Web crawlers like OpenAI’s GPTBot scrape thousands of pages across the internet, gathering masses of data nonstop from a couple of days to a few weeks. In the AI world, model makers do this to gather data — often without the platform owner’s permission.
The practice of web crawling is becoming increasingly frowned upon as rightsholders become protective of their content.
Reddit’s decision to block crawlers comes as it’s looking to protect a lucrative asset: data.
The platform has struck deals with AI developers including Google and OpenAI, providing them access to a trove of user posts in exchange for cash.
Reddit’s Google deal was worth a reported $60 million per year.
Reddit generated $810 million in 2023, primarily from advertising, according to Business of Apps. However, Reddit recently looked for other ways to generate money including charging third parties for access to its API, a move met with rage by users last June.
By restricting crawlers from scraping the platform, AI developers wanting to train their models on Reddit content would be forced to pay for a license.
“We are selective about who we work with and trust with large-scale access to Reddit content,” according to a company announcement. “Anyone accessing Reddit content must abide by our policies, including those in place to protect Redditors.”
There are some non-commercial exemptions, however, allowing researchers and archival organizations such as the Internet Archive to access Reddit content.
“The Internet Archive is grateful that Reddit appreciates the importance of helping to ensure the digital records of our times are archived and preserved for future generations to enjoy and learn from,” said Mark Graham director of the Internet Archive’s Wayback Machine. “Working in collaboration with Reddit we will continue to record and make available archives of Reddit, along with the hundreds of millions of URLs from other sites we archive every day.”
Google’s use of Reddit content didn’t go too smoothly, as its AI-powered search feature Overviews had to be revamped after it accidentally responded to user queries with absurd responses from Reddit, like suggesting jumping off the Golden Gate Bridge as a cure for depression.
Read more about:
ChatGPT / Generative AIAbout the Author
You May Also Like