ChatGPT Performance Drift – a New Risk for Business

Stanford and UC Berkeley researchers said it becomes “challenging, if not impossible, to reproduce results from the ‘same’ LLM.”

3 Min Read
Illustration of a group of down arrows in red, with the word ChatGPT in the middle, in pink
StarLineArts/Getty Images

At a Glance

  • A new study found ChatGPT's performance fluctuated greatly from March to June, with more mistakes in math and code.
  • Both free and paid versions of ChatGPT struggled while GPT-4 increasingly refused to answer sensitive questions.
  • The findings imply that businesses will find it "challenging" to stably integrate language models into their workflow.

ChatGPT’s performance can fluctuate wildly, with mistakes in its output increasing over time and making it "challenging" for stable integration into business workflows, according to a new study by researchers from Stanford University and UC Berkeley.

In the paper, ‘How is ChatGPT's behavior changing over time?’, scholars sought to uncover whether updates to an AI model aimed at improving aspects of the application ended up hurting its generation abilities.

They chose to measure GPT-3.5 and GPT-4, the two most popular large language models in use and that power ChatGPT free and paid versions.

They found that the performance and behavior of both the free version of OpenAI’s chatbot (powered by GPT 3.5) and the $20-a-month premium version (powered by GPT-4) can “vary greatly over time.”

The study tested March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on math problems, answering sensitive questions, generating code and visual reasoning, which refers to solving problems by using graphical representations such as organizing seating arrangements.

They found that GPT-4’s ability to handle math problems sank from 97.6% in March to just 2.4% in June. GPT-4 also saw response lengths drop by over 90%.

Meanwhile, the free-to-access GPT 3.5 showed better accuracy at math problems, rising from 7.4% in March to 86.8% in June.

Related:OpenAI’s Code Interpreter Lets ChatGPT Play Data Scientist

With code generation, just 10% of GPT-4's June outputs were directly executable, compared to 50% in March. GPT 3.5’s executable outputs also dropped to just 2% from 22% in March.

Moreover, the premium version of ChatGPT greatly reduced responses to potentially sensitive questions, from 21% in March to just 5% in June.

The findings show that the same large language model service can "change substantially in a relatively short amount of time.”

Findings uncover new LLM business risk

“It is currently opaque when and how GPT-3.5 and GPT-4 are updated, and it is unclear how each update reflects the behavior of these LLMs,” the researchers wrote.

“These unknowns makes it challenging to stably integrate LLMs into larger workflows,” they concluded. If the LLM’s response to a prompt in terms of accuracy and formatting “suddenly changes, this might break the downstream pipeline. It also makes it challenging, if not impossible, to reproduce results from the ‘same’ LLM.”

As a result, they said there is a need to “continuously evaluate and assess” the behavior of LLMs in production applications.

For companies who rely on large language model services, the researchers recommend they implement similar monitoring analyses as outlined in their paper.

Related:OpenAI Introduces 'Custom Instructions' for Personalized ChatGPT Outputs

The researchers plan to continue the study in the long-term. The evaluation data and responses are accessible via GitHub.

ChatGPT's evolving abilities over time

Large language model services like ChatGPT can be routinely updated over time to improve the service. In the past week, OpenAI has added 'Custom Instructions' for personalized outputs and data analysis tools including the ability to execute code via Code Interpreter.

Several users on social media have been bemoaning ChatGPT’s generation abilities, contending the application had gotten worse.

But OpenAI’s Product Vice President Peter Welinder said that GPT-4 did not become “dumber” but suggested that when using the application more often “you start noticing issues you didn't see before.”

View post on X

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Deborah Yao


Deborah Yao runs the day-to-day operations of AI Business. She is a Stanford grad who has worked at Amazon, Wharton School and Associated Press.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like