Researchers from MBZUAI have developed a new approach for automatically fact-checking outputs from large language models (LLMs) that has the potential to significantly reduce the cost of this important activity. The team’s findings were presented at the recent 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL).
LLMs have the ability to draw on the huge body of data that was used to train them and can quickly produce answers to nearly any question users ask. Often these answers are correct, but there are times when they aren’t. Due to the way they are designed, LLMs have a tendency to make up answers that sound confident but aren’t factual. This phenomenon, called hallucination, limits the utility of today’s LLMs.
Scientists have developed frameworks to automatically fact-check text generated by LLMs. These approaches typically break up outputs from models into chunks that relate to specific claims and conduct web searches to retrieve information from online sources that either support or refute these claims. These systems then use another language model to process information retrieved from the web and make a judgment about the truth of each claim.
This approach works well, but it is expensive. Each search and analysis of information retrieved from the web comes with a cost. Another drawback is that this method doesn’t take advantage of the full internal knowledge of the LLM. Even though models hallucinate and generate inaccurate outputs, it’s possible that they still may know the correct answer. Indeed, it’s believed that the largest LLMs, from companies like Anthropic, Meta, and OpenAI, have been trained on all the publicly available information on the internet.
“The evidence retrieval process might not be necessary in some cases,” says Zhuohan Xie, a postdoctoral researcher at MBZUAI and lead author of the study presented at NAACL.
Xie and his team’s new method, called FIRE (fact-checking with iterative retrieval and verification), assesses the level of confidence a model has in the claims it makes before searching for more information. If the model has a level of confidence above a certain threshold, it classifies the claim as true. If the model’s confidence is below the threshold, FIRE searches the web for more info, classifying the claim as true or false based on the information it finds.
What’s more, when FIRE resorts to searching the web, it stores the knowledge it gains from these searches to aid in determining the classification of other claims from the same piece of text. “We’ve tried to develop an iterative process that is similar to the way a human fact-checker would verify a claim,” Xie says.
Rui Xing, Yuxia Wang, Jiahui Geng, Hasan Iqbal, Dhruv Sahnan, Iryna Gurevych, and Preslav Nakov, of MBZUAI, contributed to the study.
Because FIRE relies on the internal knowledge of the LLM to determine factuality of claims, some searches don’t need to be made. “Many claims are simple enough that they don’t require searches,” Xie says. “Compared to other approaches, our framework is more dynamic, more scalable, and can save on costs.” Once it reaches the maximum number of steps, the system gives an answer based on the information it has gathered.
The researchers tested their system on four benchmark datasets: FacTool, FELM, Fact-check-Bench, and BingCheck, and processed these datasets so that they were compatible with each other.
They compared the performance of FIRE and other fact-checking frameworks on these benchmarks with reasoning and non-reasoning LLMs. The LLMs they tested included systems from OpenAI (GPT-4o, GPT-4o-mini, o1-preview, and o1-mini), Anthropic (Claude-3 Haiku, Claude-3 Opus, Claude-3.5 Sonnet), Meta (LLaMA 3.1-Instruct 8B), and Mistral (Mistral-Instruct 7B).
In these tests the researchers were interested in identifying tradeoffs between computational cost and fact-checking performance.
They found that FIRE and other fact-checking frameworks showed similar performance. But by using FIRE with GPT-4o-mini reduced LLM costs on average by 7.6 times and search costs by 16.5 times. They also found that while the performance of FIRE with OpenAI’s GPT-4o-mini wasn’t as good as using FIRE with a more advanced OpenAI model, o1-preview, it was 766 times cheaper. The researchers write that this huge difference in cost suggests that the most advanced models may not be necessary for fact-checking tasks.
The team found that non-reasoning models were cheaper to run but required more web searches — which have their own costs. And though reasoning models were more expensive, they performed better than non-reasoning models.
Interestingly, by having humans evaluate how the systems made mistakes, Xie and his coauthors discovered errors in the benchmark datasets. Some of the claims in the datasets were ambiguous or subjective. In other cases, the true-false label was simply wrong.
Xie explains that it doesn’t make sense to boost the performance of models on datasets where some of the data is wrong, as this would result in “overfitting” the model to the dataset but not help its performance in the real world.
In addition to remedying hallucinations by LLMs, fact-checking systems like FIRE can be used to identify and combat mis- and disinformation on the internet, which is found not only in text but also in images and video. Though FIRE in its current form is designed specifically for text, it could be expanded for fact-checking multimodal information as well, Xie says.
He adds that a future iteration of FIRE has the potential to serve as a kind of additional source of knowledge to complement LLMs. Knowledge from previous searches by FIRE could be used to inform subsequent searches by other users, which would further reduce the cost of fact-checking.
While there is still work to be done to develop better and cheaper fact-checking methods, Xie is amazed at the rapid advancement of these systems and the potential they have in identifying hallucinations and mis- and disinformation. “There are always exciting things happening” in the field of natural language processing, he says. And while the current benchmark datasets aren’t perfect, “we will continue to compare the performance of models on benchmarks to see how these models are progressing”.
To mark World Day for Cultural Diversity for Dialogue and Development, we look at how MBZUAI is.....
Meta shut down their professional factchecker program and replaced it with Community Notes. What does this mean.....
Researchers from MBZUAI aim to help developers build safer and more culturally aware AI tools for Arabic.....