OpenAI: ‘Impossible’ to Train Models Without Copyrighted Content

OpenAI tells the U.K. that cutting-edge AI models cannot just train on 'public domain books and drawings created more than a century ago.'

Ben Wodecki, Jr. Editor

January 11, 2024

3 Min Read

At a Glance

OpenAI claims it cannot develop state-of-the-art models without access to copyrighted materials, in comments to the U.K.
The New York Times is suing OpenAI for copyright infringement, saying ChatGPT regurgitates its news stories 'near verbatim.'
OpenAI said the Times is 'not telling the full story.' Regurgitation is a 'bug' that arose after seeing manipulated prompts.

OpenAI, the maker of ChatGPT, is claiming that it would be “impossible” to train AI models like GPT-4 without using copyrighted materials.

The startup, along with its largest investor Microsoft, is being sued by The New York Times for copyright infringement. The newspaper alleges that ChatGPT was trained on its copyrighted news and then regurgitates "near-verbatim” content, among other accusations.

But OpenAI said that restricting access to data for model training would be harmful to its products, according to comments filed with the U.K. House of Lords' Communications and Digital Select Committee.

“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” the startup said.

Moreover, “limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment but would not provide AI systems that meet the needs of today’s citizens,” OpenAI contended.

The startup further said that it complies with all applicable laws, including copyright, but expressed the belief that copyright law “does not forbid training” − arguing that it falls under fair use.

OpenAI also pointed out that websites can prevent its GPTBot web crawler from accessing a site and that it also has introduced an opt-out process for creators who want to be excluded from future DALL-E training datasets.

OpenAI said the Times lawsuit was a “surprise and disappointment,” which it learned from reading the newspaper. Talks with the Times “appeared to be progressing” as of Dec. 19 on “real-time display with attribution in ChatGPT” until the Dec. 27 lawsuit.

OpenAI said it continues to be “actively engaged” with the news media to find mutually agreeable solutions. The startup has licensing deals with Axel Springer, the publisher of Politico and Business Insider, as well as the Associated Press.

“We expect our ongoing negotiations with others to yield additional partnerships soon,” the startup said in submitted comments.

NYT ‘not telling the full story’

In a blog post, OpenAI said the Times is “not telling the full story” in its lawsuit. For example, the Times is claiming that ChatGPT regurgitates its news stories nearly word-for-word. OpenAI said this is a “rare bug” that it is working to fix.

“Memorization is a rare failure of the learning process that we are continually making progress on, but it’s more common when particular content appears more than once in the training data, like if pieces of it appear on lots of different public websites.”

News outlets often cite each other’s stories, typically with attribution to the main source.

Also, OpenAI said the Times has “intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate” near-verbatim content.

Google Brain founder Andrew Ng had pointed out the same thing: Prompts used by the Times are not ones people would normally use. He also said the word-to-word regurgitation seems to be a bug, during a fireside chat at CES 2024.

Finally, OpenAI argued that the Times can simply opt-out of training, which it did last August.

“We look forward to continued collaboration with news organizations, helping elevate their ability to produce quality journalism by realizing the transformative potential of AI,” the startup said.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

OpenAI: ‘Impossible’ to Train Models Without Copyrighted Content

At a Glance

NYT ‘not telling the full story’

About the Author(s)

Latest News

Trending articles