When disagreement becomes a signal for AI models - MBZUAI MBZUAI

When disagreement becomes a signal for AI models

Tuesday, March 24, 2026

AI models are usually trained on a basic assumption: for every example, there is one right label. A sentence is toxic or it is not, a claim is entailed or contradicted, and a legal problem belongs in one category and not another.

A paper called Training and Evaluating with Human Label Variation: An Empirical Study, coauthored by researchers at The University of Melbourne and MBZUAI and published in the journal Computational Linguistics, starts from a different premise: human annotators often disagree, and sometimes they disagree for good reasons. In many tasks, especially those involving language, ambiguity is not noise to be cleaned away, but part of the phenomenon being measured.

A lot of what we think of as modern AI depends on human judgment. Content moderation, natural language inference, moral reasoning, legal triage, and sentiment analysis all rely on labels produced by people whose interpretations can differ. One annotator may see sarcasm in a post, while another sees hostility. One lawyer may classify a problem under one area of law while another, equally qualified, may emphasize a different one.

Standard ML practices tend to flatten those variations into a single answer, often through majority vote, but that means useful information disappears in the process. To address this information loss, the paper uses a concept called human label variation, or HLV.

HLV works by treating the disagreements described above not as flaws in the data, but as signals. But once you stop pretending there is always one clean ground truth, both training and evaluation become messier. What should a model learn from a split set of human judgments, and how should its output be scored?

A new approach to modeling human judgment

That is the question the paper takes on. The researchers propose new evaluation metrics designed for data with label variation, borrowing ideas from fuzzy set theory and, less obviously, from remote sensing. In satellite image classification, a patch of land may contain water, vegetation, and roads at once, so researchers have long had to think in degrees rather than absolutes. The authors bring a similar logic into natural language processing, representing human judgments as partial memberships rather than single fixed labels.

This decision of representing human judgments works from a mathematical and intuitive point of view. Traditional information theory measures can compare distributions, but they are not always easy to interpret outside a small circle of specialists. The new “soft” metrics proposed in the paper aim to preserve the feel of familiar measures like accuracy and F-score while adapting them to cases where multiple labels may plausibly apply. Instead of asking whether the model got the one correct answer, the metric asks how well the model’s output aligns with the distribution of human judgments. One of the paper’s findings is that a Jensen-Shannon divergence-based score, an existing approach in this area, can produce misleadingly high results. The paper argues that its proposed soft micro F1 is among the strongest metrics for this setting, and recommends that future HLV work report it.

Because their soft metrics are differentiable, the researchers also tested whether the metrics could serve directly as training objectives. If a metric captures what you truly care about, why not optimize for it directly? Across six datasets, fourteen training methods, and two model scales, the best-performing approaches were usually the simpler ones. Training on disaggregated annotations directly, or training on soft labels that preserve the distribution of judgments, generally worked better across metrics than the more elaborate objectives based on the newly proposed differentiable metrics.

The experiments cover binary, multiclass, and multilabel classification, include both crowd and expert annotators, and span English and Arabic. The most distinctive dataset in the paper is a private legal corpus called TAG, built from real requests for legal help and annotated by practicing lawyers. There, disagreement reflects the fact that legal interpretation can vary across specialists with different expertise and experience.

That legal dataset also powers the paper’s most sophisticated evaluation exercise. Rather than assume a metric is good because it looks reasonable mathematically, the authors conduct a meta-evaluation. They ask lawyers to compare the outputs of different training methods pairwise and judge which result is more accurate. From those pairwise judgments, they build a human ranking of the methods. They then ask which automatic metric produces a ranking closest to the human one.

It is a good way to ground a methodological argument in expert judgment. A metric for ambiguous labels should, in some sense, agree with humans about what “better” looks like. On this test, soft micro F1 performs well, which strengthens the case that it is capturing something real.

Meeting a new need

There is a long history of talking about disagreement in labels as if it were a contamination problem. Clean the data, align the annotators, remove the edge cases, and the truth will emerge. But many real world tasks are not like image classification benchmarks from the last decade. They involve interpretation, context, domain expertise, and sometimes genuinely plural answers. AI models built for those settings need training and evaluation regimes that acknowledge that humans do not always converge.

Once variation is treated as a signal, every design choice becomes harder. You need to decide whether to train on individual annotations, aggregated distributions, or something more structured, and you need metrics that reward appropriate overlap rather than force crisp matches.

The authors acknowledge there are limits to how far the paper’s claims can go. The meta-evaluation is built on a single private multilabel legal dataset, which constrains reproducibility and leaves open the question of whether the same result would hold as strongly in other domains. Still, the paper makes a persuasive case that AI research has spent too long acting as though label certainty were the default state of the world.

If the next generation of language systems is meant to operate in domains shaped by contested judgments, then better models alone will not be enough. We will also need better ways of representing disagreement, training on it, and scoring it. This paper suggests that the path forward may be less about inventing ever more complicated machinery than about taking human variation seriously in the first place.

Related

thumbnail
Monday, March 23, 2026

Why AI can describe an image but struggles to understand the culture inside it

A new MBZUAI paper, accepted at EACL 2026, introduces JEEM – a benchmark for evaluating how AI.....

  1. image ,
  2. Arabic ,
  3. multimodal ,
  4. culture ,
  5. language ,
  6. EACL ,
  7. conference ,
  8. nlp ,
  9. research ,
  10. natural language processing ,
Read More
thumbnail
Friday, March 13, 2026

MBZUAI team awarded Google Academic Research Award to study loneliness in the age of AI

The project, led by Thamar Solorio, Monojit Choudhury, and Aseem Srivastava, will study loneliness in digital spaces.....

  1. Google ,
  2. social good ,
  3. loneliness ,
  4. GARA ,
  5. award ,
  6. nlp ,
  7. research ,
  8. natural language processing ,
Read More
thumbnail
Wednesday, February 18, 2026

MBZUAI report on AI for the global south launches at India AI Impact Summit

The report identifies 12 critical research questions to guide the next decade of inclusive and equitable AI.....

  1. inclusion ,
  2. summit ,
  3. AI4GS ,
  4. global south ,
  5. Report ,
  6. equitable ,
  7. social impact ,
Read More