ChatGPT reportedly took an early holiday vacation, as social media users report the model is generating simplified answers to queries and even refusing to complete some tasks.
"There has been discussion if GPT-4 has become 'lazy' recently," said Ethan Mollick, associate professor at the Wharton School of the University of Pennsylvania, in a post to X, the platform formerly known as Twitter, last month. "My anecdotal testing suggests it may be true."
Mollick said the system still knew how to carry out tasks, but prompted users to do the work.
OpenAI acknowledged the claims of a lazier GPT-4 earlier this month, but the company said the changes in model behavior were not intentional or easily explained.
“Training chat models is not a clean industrial process,” the company said using the official ChatGPT X account. “Different training runs even using the same datasets can produce models that are noticeably different in personality, writing style, refusal behavior, evaluation performance and even political bias.”
While enterprises forge paths to generative AI adoption, drawn by hopes of big efficiency gains, model behavior drifts present a roadblock to relying on the tech longterm. Unexpected behavior changes could impact customer interactions or operations, depending on what guardrails companies institute.
The leading AI startup said it had not updated the model since Nov. 11, indicating the process was an "artisanal multi-person effort." The company did not respond to requests from CIO Dive for comment. Microsoft, which offers access to OpenAI’s models through Azure OpenAI Service, declined to comment.
OpenAI said differences in model behavior can be subtle with only a subset of prompts degraded, making it difficult to detect and fix these patterns, according to a separate post.
Large language models are often referred to as a black box, illustrating the unknown nature of how these complex systems work. There’s not a clear answer to why large language models might act inconsistently, but researchers say these drifts can impact enterprise deployments.
“These kinds of behaviors can be a major barrier to reliable deployments of the source language models,” James Zou, assistant professor of biomedical data science at Stanford University, told CIO Dive. “If you’ve got a large language model as part of your software or data science stack and the model suddenly gets lazy or has changes to formatting, behavior and outputs, this could actually really break the rest of your pipeline.”
Unexpected changes and gaming the system
Reports of ChatGPT’s laziness are on par with the unintended changes in model behavior that researchers from Stanford and University of California, Berkeley, spotted in July, according to Zou, who was an author of the report. However, the emerging winter break thesis is somewhat different.
Researchers found model behavior, specifically GPT-3.5 and GPT-4, was getting significantly worse over time in some cases. The winter break hypothesis centers around getting higher-quality outputs by asking models to assume a certain persona.
The hypothesis was presented as a way to explain ChatGPT’s laziness. Rob Lynch, head of product at legal data as a service provider UniCourt, found that OpenAI’s GPT-4 Turbo model produces longer completions when prompted to believe it's currently May instead of December.
While some users had issues reproducing Lynch’s study, the theory that users can benefit from prompting the model with a different persona still stands, according to researchers.
“This is because, for any given topic, there is text on the web about it from a variety of perspectives — for example, there may be physics professors writing about physics, and grade school students asking questions about physics,” Matei Zaharia, CTO at Databricks and an associate professor of computer science at UC Berkeley, said in an email. “This persona approach asks the LLM to match the text from the more advanced persona and can lead to measurably better results.”
Prompting skills are quickly becoming a must-have skill for employees using generative AI at work. Yet, only 1 in 5 employees are confident in their ability to write meaningful prompts, according to Coda data. How employees craft prompts can have an outsized impact on using generative AI tools effectively.
Zaharia, who was another author of the July report on LLM behavior, said that the findings related to using a different month to get a better response were “a little surprising,” but it’s possible that the text used as training material was less detailed in December than those in May, causing the winter break behavior.
“However, the difference in performance seen with 'winter break' is still also small overall, so it could well just be due to chance,” Zaharia said.
Enterprise leaders can gain an advantage by tailoring queries with specific instructions, but large language models are very receptive to any additional guidance – for better or worse. One user found that prompting the model by saying it would get a tip resulted in longer responses, while an Apollo Research report found large language models can strategically deceive users when put under pressure.
The challenge for tech leaders is managing the potential uncertainty and risk, Zou said.
“It is extremely important to have a pipeline in place in their companies to continuously evaluate and monitor the behaviors and drifts of these larger models," Zou said. "They should also implement best practices in making the rest of their software pipelines robust to potential changes.”