ChatGPT Performance Drift – a New Risk for Business
Stanford and UC Berkeley researchers said it becomes “challenging, if not impossible, to reproduce results from the ‘same’ LLM.”
At a Glance
- A new study found ChatGPT's performance fluctuated greatly from March to June, with more mistakes in math and code.
- Both free and paid versions of ChatGPT struggled while GPT-4 increasingly refused to answer sensitive questions.
- The findings imply that businesses will find it "challenging" to stably integrate language models into their workflow.
ChatGPT’s performance can fluctuate wildly, with mistakes in its output increasing over time and making it "challenging" for stable integration into business workflows, according to a new study by researchers from Stanford University and UC Berkeley.
In the paper, ‘How is ChatGPT's behavior changing over time?’, scholars sought to uncover whether updates to an AI model aimed at improving aspects of the application ended up hurting its generation abilities.
They chose to measure GPT-3.5 and GPT-4, the two most popular large language models in use and that power ChatGPT free and paid versions.
They found that the performance and behavior of both the free version of OpenAI’s chatbot (powered by GPT 3.5) and the $20-a-month premium version (powered by GPT-4) can “vary greatly over time.”
The study tested March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on math problems, answering sensitive questions, generating code and visual reasoning, which refers to solving problems by using graphical representations such as organizing seating arrangements.
They found that GPT-4’s ability to handle math problems sank from 97.6% in March to just 2.4% in June. GPT-4 also saw response lengths drop by over 90%.
Meanwhile, the free-to-access GPT 3.5 showed better accuracy at math problems, rising from 7.4% in March to 86.8% in June.
With code generation, just 10% of GPT-4's June outputs were directly executable, compared to 50% in March. GPT 3.5’s executable outputs also dropped to just 2% from 22% in March.
Moreover, the premium version of ChatGPT greatly reduced responses to potentially sensitive questions, from 21% in March to just 5% in June.
The findings show that the same large language model service can "change substantially in a relatively short amount of time.”
Findings uncover new LLM business risk
“It is currently opaque when and how GPT-3.5 and GPT-4 are updated, and it is unclear how each update reflects the behavior of these LLMs,” the researchers wrote.
“These unknowns makes it challenging to stably integrate LLMs into larger workflows,” they concluded. If the LLM’s response to a prompt in terms of accuracy and formatting “suddenly changes, this might break the downstream pipeline. It also makes it challenging, if not impossible, to reproduce results from the ‘same’ LLM.”
As a result, they said there is a need to “continuously evaluate and assess” the behavior of LLMs in production applications.
For companies who rely on large language model services, the researchers recommend they implement similar monitoring analyses as outlined in their paper.
The researchers plan to continue the study in the long-term. The evaluation data and responses are accessible via GitHub.
ChatGPT's evolving abilities over time
Large language model services like ChatGPT can be routinely updated over time to improve the service. In the past week, OpenAI has added 'Custom Instructions' for personalized outputs and data analysis tools including the ability to execute code via Code Interpreter.
Several users on social media have been bemoaning ChatGPT’s generation abilities, contending the application had gotten worse.
But OpenAI’s Product Vice President Peter Welinder said that GPT-4 did not become “dumber” but suggested that when using the application more often “you start noticing issues you didn't see before.”
About the Authors
You May Also Like