Tutors of tomorrow? A new benchmark for evaluating LLMs

Monday, May 05, 2025

It’s been known for decades that learners who receive one-on-one tutoring show significantly better educational outcomes than those who only receive instruction in the classroom. These findings, published by educational psychologist Benjamin Bloom in the 1980s, are known as the “two sigma problem,” because students who were tutored performed two standard deviations above those who learned from standard classroom instruction alone.

And yet, it hasn’t been possible to provide a large proportion of students with one-on-one support because there simply aren’t enough trained tutors to help students.

The advent of large language models (LLMs), however, has the potential to change this, making personalized tutoring much more widely available.

A team from MBZUAI has taken an important step in this effort by developing an evaluation framework and benchmark that can be used to measure the teaching abilities of LLMs. The researchers were recently recognised for their work at the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), held in Albuquerque, New Mexico, picking up the SAC Award for Resources and Evaluation.

“LLMs are like huge knowledge bases that have the potential to be used for tutoring,” says Kaushal Kumar Maurya, a postdoctoral research associate at MBZUAI and lead author of the study. “We’re trying to figure out how the raw knowledge contained in these systems can be useful in an educational setting.”

Past and present of machine tutoring

For years, scientists have built what they call intelligent tutoring systems to help learners. Early systems were designed to provide feedback to learners according to programed rules. These systems were precise, and in certain situations they were effective, but they weren’t fluent enough to handle the wide variety of questions learners tend to have, explains Ekaterina Kochmar, assistant professor of Natural Language Processing at MBZUAI and coauthor of the study.

LLMs work differently and don’t provide responses according to pre-programed rules. They generate their outputs according to patterns they have identified in their training data and are significantly more fluent than rule-based systems of the past. But LLMs “aren’t great for teaching because they don’t have pedagogical principles embedded in them,” Kochmar explains.

It may be possible, however, to integrate principles from learning science into LLMs so that they can serve learners better. But first, AI researchers need to develop an understanding of best practices in tutoring and develop benchmarks to evaluate today’s LLMs according to these principles, Kochmar says. Only then will researchers be able to identify the strengths and weakness of LLMs when it comes to helping students.

The potential to help students is enormous: “With AI, we can provide everyone with a personalized tutor, something like a personal digital assistant on a phone,” Kochmar says. “This wouldn’t be a replacement for what we have in the classroom but would complement classroom learning.”

Insights from learning science

Other researchers have developed criteria to measure the performance of LLMs for tutoring, but these evaluations have related to specific aspects of LLM performance, making it difficult to know how, or if, these systems are improving.

Kochmar, Maurya, and their colleagues, therefore, set out to create a holistic evaluation framework that can be used to measure and track the educational performance of LLMs. It’s the first unified taxonomy based on learning science principles, and it’s focused on eight key dimensions that have been identified in previous research related to learning in mistake remediation scenarios: mistake identification, mistake location, revealing of the answer, providing guidance, actionability, coherence, tone, and human likeness.

For example, a good tutor doesn’t simply give away the correct answer when a student makes a mistake. In that case, the student wouldn’t learn much.

Instead, a good tutor will try to guide the learner to figure out the solution on their own. Unfortunately, today’s systems don’t do a great job. A previous study found that OpenAI’s GPT-3 revealed solutions to problems 66% of the time and provided incorrect feedback 59% of the time.

A new comprehensive benchmark

To evaluate how well different LLMs performed across these eight dimensions, the researchers created a new benchmark, called MRBench, which compiled 192 tutoring conversations focused on student mistakes in math.

The conversations in MRBench came from two existing datasets, Bridge and MathDial. Bridge is composed of interactions between learners and novice and expert tutors. MathDial is composed of conversations between human tutors and LLMs playing the role of learners. The last line of dialogue in each of these conversations included a wrong answer or some point of confusion on the part of the learner.

The researchers prompted seven LLMs to take on the role of expert tutors and asked them to provide an appropriate response to the last line of dialogue. The LLMs included light-weight models like LLaMA-3.1-8B and cutting-edge systems like GPT-4. Responses from human novice and expert tutors were also included in the dataset.

Once the models produced their answers, each answer was evaluated by trained, human annotators who were asked to apply the eight-dimension framework to the responses. Annotators treated each evaluation category independently; for example, they determined whether the tutor identified a student’s mistake separately from whether the tutor gave helpful guidance. This ensured that MRBench provided not just a collection of tutoring conversations, but a thoughtfully evaluated dataset that can be used to track and compare the pedagogical abilities of LLMs over time.

In addition to human annotators, the researchers had two LLMs — Prometheus2 and Llama-3.1-8B — evaluate the responses.

How LLMs performed as tutors

Overall, the study found that while some LLMs produced responses that sounded fluent, even human-like, they often fell short when they were evaluated against deeper pedagogical standards. “When we were annotating, we weren’t privy to if a response was from a model or a human, and there were cases where I couldn’t tell,” says KV Aditya Srivatsa, a research associate at MBZUAI and coauthor of the study. “But there is still a lot of room for improvement.”

The best-performing LLMs, such as OpenAI’s GPT-4 and Meta’s LLaMA-3.1-405B, were generally effective at identifying when and where students made mistakes. But these systems frequently revealed the answer too soon, rather than guiding the student toward discovering it, as was found in previous studies.

Anthropic’s Sonnet produced coherent and encouraging replies but was less consistent when it came to offering guidance. Smaller models, like LLaMA-3.1-8B, performed reasonably well considering their size, but another small model, Phi3, developed by Microsoft, struggled across most dimensions, and was unable to provide good guidance in most cases.

Expert human tutors were the best at providing actionable guidance, but novice tutors struggled across many dimensions.

Evaluation of novice and expert human tutors and seven LLMs across eight pedagogical dimensions shows that most LLMs struggle with key aspects of tutoring related to math problems. Scores are calculated according to a metric called desired annotation match rate (DAMR), which quantifies the percentage of responses from each human or LLM-based tutor that received the desired annotation labels.

 

“The results are not just numbers,” Kochmar says. “They’re insightful because they show us the comparative strengths and weaknesses of different models.”

When the team compared evaluations of responses by human annotators with evaluations done by LLMs, they found that the LLMs often produced unreliable assessments. This finding suggests that, for now, human evaluation remains necessary to accurately assess the teaching abilities of AI tutors.

Teaching AI to teach better

The creation of MRBench is only the beginning of the team’s work within this important project, which has recently received support from Google via a Google Academic Research Award (GARA). Next, the team plans to modify LLMs so that they produce not only fluid responses but ones that align with the eight dimensions adopted from learning science.

To do this, the data that are used to train models could be more pedagogically rich, Maurya says. Fine-tuning systems with high-quality data related to learning would likely help as well.

Kseniia Petukhova, a master’s student in Natural Language Processing at MBZUAI and coauthor of the study, hopes that in the future systems can incorporate additional dimensions related to personalization and provide individualized feedback to learners.

MRBench has already gained the attention of other AI researchers: for example, the shared task organized by MBZUAI’s researchers on the basis of this dataset has attracted over 50 teams from all over the world. Moreover, teams from other universities are expanding the dataset and applying the pedagogical framework to other tasks beyond mistake remediation in math problems.

While LLMs aren’t capable of replacing human tutors just yet, this research moves the field closer to solving Bloom’s two sigma problem and a future where every student has access to personalized, high-quality tutoring.

Related

thumbnail
Monday, April 28, 2025

Identifying bias in generative music models: A new study presented at NAACL

A team from MBZUAI tested the performance of models in non-Western music, with the aim of making.....

  1. music ,
  2. culture ,
  3. NAACL ,
  4. generative AI ,
Read More
thumbnail
Thursday, April 17, 2025

Teaching robots to spot danger at home: A new approach to be presented at NAACL

Xiuying Chen explains how AnomalyGen is using foundation models to help household robots anticipate dangerous scenarios.

  1. nlp ,
  2. robots ,
  3. NAACL ,
  4. robotics ,
  5. research ,
  6. natural language processing ,
Read More
thumbnail
Thursday, March 20, 2025

Can LLMs reason? New benchmark puts models to the test

The game-based dataset created by MBZUAI scientists tests LMMs' pattern recognition, spatial awareness, arithmetic, and logical thinking.

  1. intelligence ,
  2. research ,
  3. large language models ,
  4. dataset ,
  5. llms ,
  6. benchmark ,
  7. reasoning ,
Read More