OpenAI Quietly Unveils Web Crawler to Scrape Data for Its AI Models

Separately, OpenAI said it could open up the dataset used to train DALL-E

Ben Wodecki, Jr. Editor

August 8, 2023

2 Min Read
OpenAI logo
Credit: OpenAI

At a Glance

  • OpenAI unveils web crawler dubbed ‘GPTBot’ to filter sites but says it won't grab personal information.
  • Reports also emerge that the maker of ChatGPT supports licensing of AI systems more powerful than GPT-4.

OpenAI has quietly unveiled a web crawler to sift through the internet in search of data to power its AI models.

Tucked away on its API site was news about GPTBot, a web crawler or spider bot used to visit web pages. Search engines use them to index pages; OpenAI said the tool will be used to “improve future models.”

OpenAI said GPTBot filters out sites that require paywall access or platforms that go against its policies. The web crawler also will not gather personally identifiable information (PII), such as one’s full name, social security number, bank account number and the like.

The company said allowing access can “help AI models become more accurate and improve their general capabilities and safety.”

If you do not want GPTBot to crawl your website, you will need to disallow permissions in your site’s backend. Simply add the following to your site’s robots.txt − User-agent: GPTBot Disallow: /

Platform owners can also specify parts of their site the bot can and cannot access.

The announcement of a web crawler comes as OpenAI is being investigated on how it obtained the data used to build its AI models by the U.S. Federal Trade Commission. A July civil investigative demand wants detailed information on OpenAI’s datasets, including how much was obtained from publicly available websites.

Related:FTC to Investigate OpenAI Over Data Collection Practices

OpenAI may open up DALL-E's training dataset

Separately, an internal OpenAI memo obtained by Bloomberg states that the company would be willing to open up about the data it used to train its image generator tool, DALL-E.

Such a move could have major repercussions, especially considering the furor around potential copyright infringements committed by models like DALL-E in its training. OpenAI is listed among the defendants of a class action lawsuit brought by artists over alleged infringement.

OpenAI also said internally that it supports the idea of governments issuing licenses to those seeking to develop foundation models. The licensing system would be co-designed by major players in the AI space, like OpenAI, alongside lawmakers. The licensing requirement would only cover AI models more powerful than OpenAI’s flagship GPT-4.

Anna Makanju, OpenAI’s global affairs vice president, told Bloomberg that the company is not “pushing” for licenses but believes it is a “realistic” way to monitor emerging applications.

Such a proposal mirrors comments made by OpenAI CEO Sam Altman during his U.S. Senate testimony in May. He supported the idea of an agency that would be in charge of licenses for AI products – with the power to revoke privileges to entities that violate prospective rules on AI.

Read more about:

ChatGPT / Generative AI

About the Author

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like