New resources for fact-checking LLMs presented at EMNLP

Thursday, December 05, 2024
Somebody checks the generated text on a computer screen

At the recent Empirical Methods in Natural Language Processing (EMNLP) conference held in Miami, researchers from Mohamed bin Zayed University of Artificial Intelligence shared new findings related to improving the factuality of large language models (LLMs). The work also features a web application where users can fact-check text generated by LLMs, test the factuality of LLMs across several benchmarks and examine the performance of automated fact-checkers.

Today’s most popular LLMs, like OpenAI’s GPT and Meta’s LLaMA, can be extremely helpful for users. That said, researchers have found that approximately 10 percent of the claims these systems make are false. An added challenge is that falsehoods generated by these systems are difficult for users to identify as they are presented confidently in LLM outputs alongside accurate information.

Since the launch of ChatGPT at the end of 2022, developers have built a variety of automated fact-checking systems. These tools, however, aren’t always accurate, explained Yuxia Wang, a postdoctoral researcher at MBZUAI and co-author of the studies presented at EMNLP. Indeed, in one of the papers presented at the conference, Wang and her colleagues found that some of today’s best automated fact-checkers still miss nearly 40 percent of false claims generated by LLMs.

Challenges of fact-checking LLMs

Fact-checking is no easy task. There are several steps to the process, and it’s important to understand at which step a system may go awry. And while fact-checking systems have often been measured according to their final output, this method is of limited value as it doesn’t provide details about where in the process a system fails, Wang explains. “This motivated us to create a fine-grained benchmark to analyze each step so that developers can evaluate their systems and identify which steps are weak and which are strong so that the system can be improved,” she says.

In their study, Wang and her colleagues describe a series of eight tasks that automated fact-checkers follow to identify false claims and correct them. These include decomposition, where a system must break down a response generated by an LLM into “context-independent atomic statements.” For example, take the LLM output: “Elon Musk bought Twitter in 2020 and renamed it to X.” There are three atomic claims in the sentence: Musk bought Twitter; he bought it in 2020; and he renamed it to X.

Each of these claims must be pulled out of context from the output and be independently verified. Before this can be done, however, the claims must be decontextualized, which is another step in the framework. To verify claims, LLM outputs must be broken into smaller, independent claims, such as verifying separately that Elon Musk bought Twitter, that it happened in 2020, and that he renamed it to X.

As Wang and her colleagues explain in the paper, the decontextualization step first splits an LLM output into sentences and then breaks sentences into claims, “with each claim containing only one property or fact or verdict.”

The other steps in the fact-checking pipeline are: checkworthiness identification; evidence retrieval and collection; stance detection; correction determination; claim correction; and final response revision.

In addition to the framework, Wang and her team also developed a benchmark called Factcheck-Bench designed to evaluate the performance of automatic fact-checkers. The benchmark is composed of nearly 700 claims made by LLMs in English across different areas of knowledge and annotated by people.

Evaluating LLMs and fact-checkers

Factcheck-Bench informed the development of another framework called OpenFactCheck that is designed to evaluate both LLMs and automated fact-checkers. The authors describe OpenFactCheck as a unified framework composed of three modules.

The first module, ResponseEvaluator, is an automatic fact-checker built into a web application that users can customize to verify claims produced by LLMs. It consolidates a process that is often handled by separate systems, explains Hasan Iqbal, a master’s student at MBZUAI and co-author of the study. ResponseEvaluator includes a claim processor that breaks down a document into individual claims, a retriever that gathers evidence from the web and a verifier that compares a claim to the evidence gathered.

Iqbal notes that working on developing ResponseEvaluator brought to light obstacles related to automated fact-checking. For example, there are geographical variations when retrieving information. “Depending on where you use the system, you may get different information through a web search,” he says. “There are also claims that are correct in some regions and not others.”

Another interesting challenge, Iqbal notes, relates to how facts can change over time. For instance, a person could have had a long career at one organization and move on to another, with the result being that most information on the web relates to their previous position. “In these cases, the system will give us wrong verification results,” he says.

The second module, LLMEvaluator assesses the factuality of LLMs across several benchmark datasets with the goal of identifying a model’s strengths. Developers can use the OpenFactCheck web app to test their models on these datasets and receive reports about their model’s performance.

The third module, CheckerEvaluator, is a fact-checker evaluator and leaderboard that is designed to encourage the development of new systems. When developers build fact-checking tools, they often do so according to different strategies and with different priorities in mind, Wang explains. Some run fast but are expensive. Others are based on local databases, while others query remote databases. All these considerations must be balanced when building these systems, she says, and there is clearly a need for new systems with improved performance.

OpenFactCheck is fully open source and the code is available as a Python library. “We hope that people will install OpenFactCheck and leverage these tools to enhance the accuracy of their systems,” Iqbal says. “As an open-source project, we encourage the community to contribute, collaborate and help make OpenFactCheck even more impactful.”

Related

thumbnail
Thursday, January 09, 2025

LLMs 101: Large language models explained

LLMs are a staple of AI, but what exactly are they? Our 101 guide breaks down the.....

  1. tokens ,
  2. LLM360 ,
  3. llms ,
  4. llm ,
  5. large language models ,
  6. nlp ,
  7. open source ,
  8. natural language processing ,
Read More
thumbnail
Wednesday, December 18, 2024

AI and the Arabic language: Preserving cultural heritage and enabling future discovery

The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....

  1. Arabic LLM ,
  2. atlas ,
  3. language ,
  4. United Nations ,
  5. Arabic language ,
  6. llms ,
  7. large language models ,
  8. jais ,
Read More
thumbnail
Thursday, December 12, 2024

Solving complex problems with LLMs: A new prompting strategy presented at NeurIPS

Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.

  1. machine learning ,
  2. neurips ,
  3. llms ,
  4. processing ,
  5. prompting ,
  6. problem-solving ,
Read More