Dive Brief:
- Anthropic’s Claude 3.5 Sonnet outperformed competitors overall across short, medium and long context windows, according to a Monday report by generative AI platform Galileo. Machine learning engineers tested 22 popular models.
- The Hallucination Index evaluated model accuracy, performance and cost. Google's Gemini 1.5 Flash was one of the best models for cost, Galileo said.
- Among open-source competitors, Alibaba’s Qwen2 model series performed on par with one of Meta‘s Llama 3 models during short and medium-context testing but outperformed in supporting longer context sets, the report said.
Dive Insight:
Hallucinations, or inaccurate results, present a hurdle for enterprise adoption. CIOs are worried about AI’s precision and how much employees trust the tools and the results they obtain.
The majority of tech leaders are unsure if it is possible to gauge AI output accuracy, according to a Juniper Networks survey.
In the hunt for enterprise customers, vendors have beefed up risk mitigation and security capabilities as they try to stand out in a crowded field.
Anthropic CEO Dario Amodei pitched enterprise and public sectors on the startup’s safety, security and responsibility in Washington, D.C. earlier this summer. Amazon, which has a minority ownership position in Anthropic, has focused on reducing the chance of models producing incorrect results when building out its portfolio as well.
SAP infused its cloud-based data service SAP Datasphere with knowledge graphing capabilities meant to inhibit model hallucinations in March. Google expanded its grounding capabilities to curb hallucinations in July. And Microsoft launched several tools in Azure to increase the trustworthiness of generative AI applications in March.
Analysts say CIOs should encourage employees to deploy skepticism when using these tools and to only use them as part of initial draft stages, rather than copy and paste to production.
Accuracy of AI tools is crucial, but enterprises face a number of other barriers.
By the end of 2025, at least 3 in 10 generative AI projects will be abandoned after the proof of concept stage, Gartner predicts. Poor data quality, inadequate risk controls, unclear business value and increased costs are driving the failure rate, according to Gartner analysts.