Dive Brief:
- AI model performance improved significantly over the past two years, according to the latest AI Index report from the Stanford Institute for Human-Centered AI. The research, education, industry and policy group analyzed 29 benchmarks, evaluations and leaderboards to create the 400-plus page study.
- One benchmark evaluating a model's ability to resolve GitHub issues from popular open-source Python repositories found the best-performing model at the end of 2023 scored 4.4%, while OpenAI’s o3, released to researchers and developers in December, solved nearly 72% of problems by early 2025.
- OpenAI’s o1, which was introduced in September, landed the top spot in a multidiscipline task benchmark evaluating multimodal models on deliberate reasoning and college-level subject knowledge. The o1 model scored 4.4 points below the human benchmark and 18.8 points higher than last year’s state-of-the-art score.
Dive Insight:
AI model costs, accessibility and other areas have room to grow even as analysis of model performance suggests drastic improvement.
The Stanford Institute for Human-Centered AI research found energy efficiency has increased by 40% each year, while hardware costs have declined by 30% annually. Models are also becoming smaller and more efficient. Microsoft’s 3.8 billion parameter Phi-3-mini scored higher than 60% on a widely used benchmark where the smallest model to reach the threshold had 540 billion parameters.
Cost and accessibility have moved to the forefront of criteria enterprises are assessing in model decisions. China-based AI startup DeepSeek captured attention earlier this year when it claimed its R1 model rivaled leading U.S. models at a fraction of the training cost, underlining the enterprise friction with existing cost structures.
Responsible AI is another area where CIOs are taking a closer look. Researchers have created new benchmarks and sounded the alarm on poorly constructed tests, according to Stanford’s analysis.
The HELM Safety, which provides a comprehensive evaluation of language models, and AIR-Bench, which focuses on government regulations, are two examples of benchmarks that evaluate models based on responsible AI metrics. Anthropic’s Claude 3.5 Sonnet is considered the safest in the HELM Safety test, followed closely by OpenAI’s o1.
Analysts have cautioned CIOs against going all-in on one model or vendor, and instead recommended striving for model-agnostic platforms as the pace of innovation persists.
Expedia Group developed an internal experimentation platform with that in mind.
“We really want to make sure we can take advantage of the latest, coolest model,” Shiyi Pickrell, SVP of data and AI at Expedia Group, told CIO Dive. “Some of them have better infrastructure or capabilities, so we built this generic layer allowing us to use different models based on the use case or cost.”
The overall performance gaps between model competitors have narrowed, too, underlining the need for flexibility.
There was an 11.9% performance gap between the highest and 10th-ranked model in one assessment included in Stanford’s AI Index report last year. The difference shrank to 5.4% this year. Meanwhile, the gap between the top U.S. models and the best Chinese model was 9.26% last year and dwindled to 1.70% in a February assessment.