Welo Data has released a groundbreaking research paper, “A Novel Framework for Testing Causal Reasoning in LLMs: Design, Data Collection, and Evaluation,” which introduces a robust multilingual methodology for assessing the causal reasoning capabilities of large language models (LLMs). The study reveals significant gaps in existing AI models’ ability to consistently and accurately process causal relationships, particularly across languages with diverse typological and linguistic features
Addressing Critical Gaps in AI Reasoning
Causal reasoning—the ability to understand cause-and-effect relationships—is a fundamental step toward achieving artificial general intelligence (AGI). While LLMs demonstrate proficiency in pattern recognition and statistical correlations, they are still in the early stages of mastering causal reasoning. The top performing model answers seven out of 10 questions correctly while the average model correctly answers only a little more than half.
“Our research highlights that LLMs frequently fail at causal reasoning tasks, even in English, and their performance declines significantly in languages such as Turkish and Arabic,” comments Dr. Abigail Thornton, co-author of the study and Research Lab Lead at Welo Data. “This underscores the need for more linguistically diverse training data to improve AI’s ability to understand causality beyond memorized correlations.”
A Multilingual and Structured Evaluation Approach
Existing causal reasoning benchmarks often fall short in both linguistic diversity and task complexity. To address this gap, the Welo Data research team developed a rigorous dataset featuring narrative-based causal reasoning prompts designed to mirror the analytical challenges faced by expert human analysts—those with advanced degrees and at least five years of professional experience. This dataset not only demands nuanced reasoning but also spans six languages: English, Spanish, Japanese, Korean, Turkish, and Standard Arabic. The study evaluated more than 20 LLMs from 10 different developers, assessing their accuracy and consistency in identifying complex causal relationships across diverse linguistic and contextual frameworks.
“We crafted narrative documents from different perspectives of participants involved in fact-based scenarios to test whether models could identify causality reliably,” explains Dr. Fernando Migone, co-author and Vice President, Transformation, at Welo Data. “Our findings show that many models are inconsistent—even when presented with the same logical problem from different viewpoints.”
Key Findings and Implications
- Performance Disparities Across Languages: English and Spanish yielded the highest accuracy, while models struggled significantly with Turkish and Arabic, likely due to linguistic complexity and lower representation in training data.
- Inconsistencies in Model Responses: LLMs frequently provided different answers to identical causal questions, depending on how the prompts were structured.
- Challenges with Chain-of-Thought (CoT) Prompting: While some research suggests that prompting the model to 'think through the reasoning process' (e.g. CoT Prompting) can enhance performance, Welo Data’s findings reveal mixed results, implying additional areas of future research and study.
These results emphasize the need for improved model training methods, particularly in multilingual and complex reasoning tasks.
Advancing AI’s Causal Reasoning Capabilities
By establishing a new benchmark for evaluating causal reasoning, Welo Data aims to drive advancements in AI research and development and help developers elevate the performance of their multilingual AI models. The team advocates for further investment in multilingual causal reasoning datasets and refined training methodologies to bridge current gaps.
“The path to AGI requires AI systems that can reason effectively, not just predict patterns,” adds Dr. Thornton. “Our research lays the groundwork for the next stage of AI development—one that prioritizes robust, cross-linguistic reasoning capabilities.”
The full research paper is available here. For more information about Welo Data's Model Assessment Suite and its research, visit welodata.ai.
Welo Data
Welo Data, a division of Welocalize, stands at the forefront of the AI training data industry, delivering exceptional data quality and security. Supported by a global network of over 500,000 AI training professionals and domain experts, along with cutting-edge technological infrastructure, Welo Data fulfills the growing demand for dependable training data across diverse AI applications. Its service offerings span a variety of critical areas, including data annotation and labeling, large language model (LLM) enhancement, data collection and generation, and relevance and intent assessment. Welo Data's technical expertise ensures that datasets are not only accurate but also culturally aligned, tackling significant AI development challenges like minimizing model bias and improving inclusivity. Its NIMO (Network Identity Management and Operations) framework guarantees the highest level of accuracy and quality in AI training data by leveraging advanced workforce assurance methods. welodata.ai
Welocalize, Inc.
Welocalize, a leader in innovative translation and global content solutions, is ranked as one of the world's largest language service providers. Specializing in optimizing customer engagement through localized content, the company has helped some of the world's largest organizations achieve superior business outcomes with multilingual, global content. Central to its approach is OPAL, an AI-enabled platform integrating machine translation, large language models, and natural language processing to automate and enhance translations across over 250 languages. welocalize.com