Sponsored by Google Cloud
Choosing Your First Generative AI Use Cases
To get started with generative AI, first focus on areas that can improve human experiences with information.
OpenAI tells the U.K. that cutting-edge AI models cannot just train on 'public domain books and drawings created more than a century ago.'
OpenAI, the maker of ChatGPT, is claiming that it would be “impossible” to train AI models like GPT-4 without using copyrighted materials.
The startup, along with its largest investor Microsoft, is being sued by The New York Times for copyright infringement. The newspaper alleges that ChatGPT was trained on its copyrighted news and then regurgitates "near-verbatim” content, among other accusations.
But OpenAI said that restricting access to data for model training would be harmful to its products, according to comments filed with the U.K. House of Lords' Communications and Digital Select Committee.
“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” the startup said.
Moreover, “limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment but would not provide AI systems that meet the needs of today’s citizens,” OpenAI contended.
The startup further said that it complies with all applicable laws, including copyright, but expressed the belief that copyright law “does not forbid training” − arguing that it falls under fair use.
OpenAI also pointed out that websites can prevent its GPTBot web crawler from accessing a site and that it also has introduced an opt-out process for creators who want to be excluded from future DALL-E training datasets.
OpenAI said the Times lawsuit was a “surprise and disappointment,” which it learned from reading the newspaper. Talks with the Times “appeared to be progressing” as of Dec. 19 on “real-time display with attribution in ChatGPT” until the Dec. 27 lawsuit.
OpenAI said it continues to be “actively engaged” with the news media to find mutually agreeable solutions. The startup has licensing deals with Axel Springer, the publisher of Politico and Business Insider, as well as the Associated Press.
“We expect our ongoing negotiations with others to yield additional partnerships soon,” the startup said in submitted comments.
In a blog post, OpenAI said the Times is “not telling the full story” in its lawsuit. For example, the Times is claiming that ChatGPT regurgitates its news stories nearly word-for-word. OpenAI said this is a “rare bug” that it is working to fix.
“Memorization is a rare failure of the learning process that we are continually making progress on, but it’s more common when particular content appears more than once in the training data, like if pieces of it appear on lots of different public websites.”
News outlets often cite each other’s stories, typically with attribution to the main source.
Also, OpenAI said the Times has “intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate” near-verbatim content.
Google Brain founder Andrew Ng had pointed out the same thing: Prompts used by the Times are not ones people would normally use. He also said the word-to-word regurgitation seems to be a bug, during a fireside chat at CES 2024.
Finally, OpenAI argued that the Times can simply opt-out of training, which it did last August.
“We look forward to continued collaboration with news organizations, helping elevate their ability to produce quality journalism by realizing the transformative potential of AI,” the startup said.
Read more about:
ChatGPT / Generative AIYou May Also Like