Truth from uncertainty: using AI’s internal signals to spot hallucinations - MBZUAI MBZUAI

Truth from uncertainty: using AI’s internal signals to spot hallucinations

Thursday, December 04, 2025
Somebody checks the generated text on a computer screen

When language models aren’t sure what to say, they have the tendency to make things up. This is a problem. But what makes it even worse is that models sound confident when they fabricate information, making it difficult for users to spot these hallucinations.

Researchers have developed different methods to address hallucination, which include referring to external knowledge sources or cross-checking model outputs with other systems. But current approaches have limitations and can be computationally expensive.

A team of researchers from MBZUAI and other institutions has taken a different approach that holds the potential to make hallucination identification more accurate and efficient.

Their framework is grounded in the idea that language models inherently encode signals about the confidence of generated answers in their internal states and this “self-knowledge” can be effectively extracted to identify hallucinations. It’s based on a concept known as uncertainty quantification.

The researchers built and trained what they called uncertainty quantification heads (UQ heads). These are small, efficient modules that can probe internal states of a language model and provide credibility estimates for generated text fragments, marking potential hallucinations. These modules don’t alter the model’s generation process; they rely solely on model’s internal signals and require no external knowledge. They can also be used in a fully plug-and-play manner.

The team found that UQ heads achieved state-of-the-art performance in claim-level hallucination detection and they worked well on both in-domain and out-of-domain prompts. They even generalized to languages the models were not trained on.

The team’s findings were recently presented at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) held in Suzhou, China.

The authors of the study are Artem Shelmanov, Ekaterina Fadeeva, Akim Tsvigun, Ivan Tsvigun, Zhouhan Xie, Igor Kiselev, Nico Daheim, Caiqi Zhang, Artem Vazhentsev, Mrinmaya Sachan, Preslav Nakov, and Timothy Baldwin.

From patterns in attention to hallucination detection

UQ heads leverage two key sources of information from language models: attention maps and logits.

Attention maps show where a model is “looking” when generating each word, whether it’s focused on the prompt or on its own output. Paying too much attention to the output can be a hallmark of hallucination.

Logits are values that relate to how confident a model is in its word choices. When the difference between the highest value and the next highest is large, the model is confident. When the difference is small, a model is less confident and may be hallucinating.

Shelmanov, a senior research scientist at MBZUAI and lead-author of the study, says that he and his colleagues had worked on this problem for a couple of years and noticed that attention weights and logits provided a strong signal for hallucination detection. “Our intuition was that attention-based signals could generalize better than those derived from hidden states,” he says. “When using hidden states, a hallucination detector tends to memorize which topics it handles well and which it doesn’t, instead of learning a truly generalizable notion of confidence.”

Another element that makes their solution strong compared to other methods is that their modules are built on the transformer architecture. “We chose the transformer because of its expressive power, flexibility, and proven generality across a wide range of natural language processing tasks,” Shelmanov says.

They trained the modules with an automatic pipeline that uses another language model that acts as a “judge” to generate annotated training data. In this setup, a language model — or several of them — are used to annotate text to identify hallucinations with high precision, Shelmanov explains. The uncertainty quantification head is then trained on this annotated data to predict hallucinations based on the internal representations of the attention weights and logits of the target language model.

The automatic pipeline is a cost-effective way to build large datasets produced by a variety of language models that can be annotated with claim-level hallucinations, the researchers explain.

Strong performance across domains and languages

The team’s approach demonstrated significant improvements over others.

They compared their UQ heads to other methods and found that it surpassed the next-best performing system by 5% on an in-domain evaluation. On out-of-domain measurements, it outperformed other supervised methods on all domains, except for landmarks, where it was slightly below the closest competitor.

The researchers also found that the UQ heads displayed strong cross-lingual generalization on out-of-domain languages. It performed 13% better in German, 10% better in Chinese, and 9% better in Russian than other systems even though the UQ heads were only trained on English.

“We were quite surprised that it was able to generalize not only to other domains, but also to other languages,” Shelmanov says. “This means that you can train your hallucination detector on a person’s biographies in English and apply them to German or Chinese versions of those biographies.”

Shelmanov says that this cross-lingual generalization may be due to hallucination patterns inside models being similar in different languages.

Real-world applications

Uncertainty quantification is important for many applications and could be integrated into today’s language models to visually “highlight text fragments that are likely unreliable, helping users recognize and cross-check potentially hallucinated content,” Shelmanov explains.

Uncertainty quantification is also complementary to other methods and could be used in combination with them.

It’s extremely efficient, too. “We can replace massive reward models with compact uncertainty quantification heads that contain only 10 to 20 million parameters, achieving comparable performance with significant computational savings.”

Shelmanov says that the implications of uncertainty quantification go beyond hallucination detection. For example, it could be used in test-time scaling algorithms to improve the reasoning performance of language models. “With our techniques, we can boost model accuracy on complex reasoning tasks related to mathematics or multi-step planning, all without increasing model size.”

Related

thumbnail
Monday, December 01, 2025

What reinforcement learning can teach language models about reasoning

GURU, a new benchmark from the Institute of Foundation Models at MBZUAI, exposes the uneven ways reasoning.....

  1. K2 Think ,
  2. Institute of Foundation Models ,
  3. IFM ,
  4. reasoning ,
  5. neurips ,
Read More
thumbnail
Thursday, November 27, 2025

Fine-grained species recognition with MAviS: a new dataset, benchmark, and model

Led by MBZUAI researchers, MAviS could be valuable for environmental agencies and organizations involved in monitoring avian.....

  1. nlp ,
  2. detection ,
  3. birds ,
  4. avian ,
  5. nature ,
  6. environment ,
  7. multimodal ,
  8. EMNLP ,
  9. llm ,
  10. natural language processing ,
Read More
thumbnail
Wednesday, November 26, 2025

How many queries does it take to break an AI? We put a number on it.

MBZUAI researchers propose a universal ruler for jailbreak risk, measuring bits per query to forecast an attacker’s.....

  1. research ,
  2. paper ,
  3. jailbreaking ,
  4. queries ,
  5. llms ,
  6. neurips ,
  7. conference ,
Read More