Why data curation can make or break large language models

This audio is auto-generated. Please let us know if you have feedback.

Editor’s note: The following is a guest post from Noah Pruzek, head of technology services, engineering at Thomson Reuters.

Did you know pumpkins are taxed differently if they’re being used to decorate your house for Halloween, fill your pie at Thanksgiving or flavor your latte? Little details like that, found all over global tax codes and legal systems, are subtleties that can throw AI-powered research tools for a loop if they are not trained on the right datasets.

Programming that level of detail into AI is not as simple as setting a mass market large language model loose on the tax code. It requires a level of data curation and data stewardship that few businesses have even thought about, let alone implemented.

Before an AI-powered tax research tool can pinpoint the right tax for the right product in the right context, it must first be able to ingest a wide assortment of local tax codes, guidance from standards boards, regulatory filings, legal interpretations and more.

To get that largely unstructured and constantly changing data to the place where it can be codified, processed and updated in real-time, it must go through a rigorous architecture process.

Specialization is key

Too often, companies test-driving new generative AI solutions that rely on generic LLMs can’t get the level of nuance and deep domain expertise needed to parse things like context-based taxes on pumpkins, complex legal precedent, or hyperlocal compliance requirements.

While publicly available LLMs like GPT-4, Llama and Mistral are amazing resources for quickly scouring publicly available datasets and returning detailed insights, they don’t come out of the box fine-tuned for professional-grade work.

That’s why Gartner recently projected a major evolution away from general purpose LLMs over the next several years. In fact, they predict that some 50% of the generative AI models that enterprises use will be specific to either their industry or business function by 2027. That’s up from just 1% in 2023.

To truly understand the difference between general purpose LLMs and more specialized business solutions, it’s important to look more closely at the data inputs that make outputs possible – and the rigorous engineering work that goes into making data usable.

To produce an accurate search result or summarization of key issues in the U.S. tax code, for example, AI tools need to draw upon thousands of unique sources.

This includes documents from across the courts system, federal tax code documents, local and hyperlocal tax codes as well as analysis and guidance from legal scholars and news coverage to name a few – all of which is changing all the time.

And that’s just the data acquisition phase.

To make that data useful, it needs to be integrated, standardized and organized from a ragtag assortment of unstructured PDFs, spreadsheets, policy memos, head notes, scans, video, audio and countless other formats into a data architecture that can be ingested by an LLM.

The human factor

Getting that raw data to a place where it can be used to power a generative AI solution requires two fundamental steps that are not typically a part of the development process for mass market LLMs.

The first is called grounding — the process through which an LLM is augmented with use-case specific information that is not part of the embedded core knowledge of the LLM. The most popular technique for doing this is called retrieval augmented generation, or RAG.

This step is akin to the knowledge leap a person makes when they go from a college degree to graduating law school. Developers must have all of the specialized data and the domain expertise to organize it in such a way that the eventual outputs will be useful to end-user professionals.

Then comes the second critical component to developing professional-grade LLMs: human experts. No amount of computing power or innovation will ever replace human subject matter experts, who are central to creating industry-specific LLMs.

The future of generative AI

One of the first major headlines that launched the generative AI revolution two years ago was ChatGPT passing the bar exam. Ever since, there has been a slightly over-simplified view of generative AI as an all-powerful technology capable of disintermediating human expertise.

In fact, the underlying technology that made it possible for AI to excel in a structured environment like a standardized test is a massive breakthrough that will change the way we think about technology forever. But it was just the foundation.

Before generative AI can be reliably trusted for unstructured professional tasks, the underlying data that goes into models – and the manner in which that data is curated and engineered – will define the future of AI.

Developers who can get to the highest degree of specialization in their model development processes will be those that set the standard for professional-grade AI.