Dive Brief:
- OpenAI’s GPT-4 Turbo is currently dominating opponents in a chatbot arena ranking large language models by their performance in a set of multiturn questions and a battery of 57 tasks.
- Users can vote for the better of two large language models by asking any question and identifying a winning response. Once a user chooses the better answer, the platform reveals the selected models. More than 20 LLMs are part of the arena, including Google’s Gemini, Meta’s Llama 2 and The Technology Innovation Institute’s Falcon.
- The top three models in the leaderboard, created by the Large Model Systems Organization, are OpenAI models, with Anthropic’s Claude 1 and Claude 2 rounding out the top five, according to the results, which were last updated Wednesday.
Dive Insight:
While OpenAI kicked off the flood of generative AI tools with the launch of ChatGPT last year, enterprises can now pick between dozens of large language models. It's up to organizations to design the testing mechanism that works for them to find the ideal model.
“Benchmarking LLM assistants is extremely challenging because the problems can be open-ended, and it is very difficult to write a program to automatically evaluate the response quality,” the Large Model Systems Organization said in a blog post in May. “In this case, we typically have to resort to human evaluation based on pairwise comparison.”
Technology leaders are looking for reliability, performance, security and interoperability with the existing tech stack when evaluating models for implementation. Recently identified changes in model behavior for some of OpenAI’s models present a challenge.
CIOs will need to ensure proper management of models to detect when changes occur and the subsequent impacts on operations and end-user experience. But vendors have a role to play in the process.
“One thing that vendors could do is provide more checkpoints of their model,” James Zou, assistant professor of biomedical data science at Stanford University, told CIO Dive. “After our previous research came out, OpenAI actually decided to maintain the earlier checkpoints of their model from the initial release in March, because what we found is that there’s a lot of behavior drift and, on some tasks, the March model was doing better than later versions.”
Vendors that maintain earlier checkpoints can provide enterprises with a little extra protection because if the model changes, companies can default to previous versions, Zou said.