Dive Brief:
- While OpenAI is updating its large language models, the behavior of models is not necessarily getting better over time — in some cases, it's significantly worse, according to research published Tuesday from Stanford and UC Berkeley.
- The largest gaps in LLM behavior were related to solving math problems, researchers found. When asked to identify whether a number is prime, the GPT-4 model was nearly 98% accurate in March while the June model was accurate only 2% of the time, according to the report. The GPT-3.5 model, however, got better, with a nearly 80 percentage point jump in accuracy from March to June.
- “When we release new model versions, our top priority is to make newer models smarter across the board,” OpenAI said in an updated blog post Thursday. “We are targeting improvements on a large number of axes, such as instruction following, factual accuracy and refusal behavior.” The company is extending support for certain GPT-3.5 Turbo and GPT-4 models through at least June 2024 after reviewing feedback from customers.
Dive Insight:
The researchers evaluated the March and June versions of GPT-3.5 and GPT-4 on the model's ability to solve math problems, answer sensitive questions, generate code and perform tasks involving visual reasoning.
“Our goal here is not to provide a holistic assessment but to demonstrate that substantial ChatGPT performance drift exists on simple tasks,” the report said. “We are adding more benchmarks in future evaluations as part of a broader, long-term study of LLM service behavior.”
When presented with 50 code-generation problems from LeetCode’s easy category, the percentage of executable GPT-4 generated code dropped from 52% in March to 10% in June. GPT-3.5’s performance decreased from 22% to 2%.
“The team is aware of the reported regressions and looking into it,” Logan Kilpatrick, developer relations leader at OpenAI, said in a tweet in response to Matei Zaharia, CTO at DataBricks and one of the researchers from the report.
“It would be cool for research like this to have a public OpenAI evaluation set. That way, as new models come online, we can test against these known regression cases,” Kilpatrick said.
The study comes as pressure mounts for OpenAI and businesses experiment with the technology in operations.
“For users or companies who rely on LLM services as a component in their ongoing workflow, we recommend that they should implement similar monitoring analysis as we do here for their applications,” the report said.
OpenAI is one of the seven AI companies to commit to The White House’s AI evaluation process meant to ensure systems are built securely and increase transparency regarding model behavior. As part of the process, OpenAI has pledged to facilitate third-party discovery and reporting of vulnerabilities in their AI systems.
“Some issues may persist even after an AI system is released and a robust reporting mechanism enables them to be found and fixed quickly,” The White House said in the announcement Friday.