LLMs and the language of empathy: New research presented at EMNLP

Tuesday, November 26, 2024
A man chooses between digital images of a happy, sad and neutral face, showing the ability to understand and interpret emotion and empathy.

There have been countless science fiction books and films that feature moments in which machines are unable to grasp emotions that come so easily to humans, such as grief, remorse, fear or love. The general lesson is that where machines are logical and objective, humans are emotional and shaped by their own subjective experiences.

While it may be presented as certain kind of weakness in these stories, our ability to empathize with others and perceive how their experiences differ from our own greatly enhances our ability to understand and communicate. If machines were able to empathize in a similar way, they might be able to understand us better as well.

A recent study by researchers at Mohamed bin Zayed University of Artificial Intelligence and Monash University examined large language models’ ability to interpret concepts like empathy, emotion, and morality in written stories and proposed ways to improve their capabilities with these complex concepts. The study was presented at the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), held in Miami.

Yuxia Wang, a postdoctoral researcher in natural language processing at MBZUAI and a lead author of the study, explains how in fields like healthcare, it’s necessary that language models communicate with empathy. “We of course want to design models that provide the best performance, but we also need models that are empathetic and provide mental comfort,” she says.

Measuring empathy with machines

“It’s an interesting problem: how we can make machines more empathic and how they can comprehend our emotions by using language,” says Muhammad Arslan Manzoor, a Ph.D. student in natural language processing at MBZUAI and a lead author of the study.

Large language models (LLMs), like OpenAI’s GPT series and Meta’s LLaMA, are designed to build a semantic understanding of language, which relates to their ability to interpret the relationships between words. This semantic capability makes it possible for LLMs to show great facility generating text on a wide variety of topics and in different styles. However, there are different kinds of meaning that are conveyed in language, Manzoor explains: “Two stories can be semantically similar and empathically dissimilar at the same time.”

Manzoor, Wang and their colleagues’ findings build on work by a team of researchers from MIT and other institutions who created a framework for modeling empathic similarity between narratives. The authors described empathic similarity as the way people perceive similarities between themselves and others and how they resonate with these characteristics.

In the MIT study, researchers compiled a collection of short narratives in a dataset called EmpathicStories. They used an LLM to summarize the stories and label them. They also hired human workers to determine the level of similarity between pairs of stories. The workers assigned ‘similarity scores’ to story pairs, ranging from 1 to 4, with 1 being dissimilar and 4 being similar. Scores were given across four dimensions: empathy, event, emotion, and moral. Each pair of stories was rated by two annotators. These human-generated labels served as the so-called “ground truth” in the dataset. In addition, the researchers trained language models to analyze the stories from the dataset and to determine how pairs were similar.

Manzoor and his team also tested the ability of several language models to predict similarity between stories from the EmpathicStories dataset and compared the models’ predictions to the ground truth labels.

The models the MBZUAI team tested, however, did not predict empathic similarity scores that corresponded to the ground truth labels with high accuracy. The researchers employed different techniques — contrastive learning, LLM reasoning and fine-tuning with and without chain-of-thought — to improve the performance of the models, but doing so resulted in increased performance of 5% to 10%, with overall accuracy hovering near 40% in many cases. “We saw some improvement, but even after fine-tuning, the models got stuck,” Manzoor says.

Challenges of subjectivity

The poor performance of the models led Manzoor and his colleagues to think that the impediment might in fact lie with the data itself and the subjective nature of interpreting concepts related to empathy.

Since the EmpathicStories dataset was annotated by humans who each have different perceptions of morality and emotions, it is extremely difficult to determine ground truth for this kind of data. One annotator might give a pair of stories a similarity score of 2 (somewhat similar), while another annotator might give the pair a score of 4 (extremely similar). “We can’t deal with this task the way it’s dealt with now, which has been to hire annotators and call their annotations gold labels,” Manzoor says. “We can’t truly call them gold labels because it depends so much on the subjective interpretation of the people doing the annotations.”

The researchers explored how individual variability between human annotators might have affected the creation of the gold labels. They hired their own group of annotators from different cultural and ethnic backgrounds to annotate story pairs from the EmpathicStories dataset. They found that when the annotators read the full stories instead of summaries, they had higher correlations in similarity scores. (In the MIT study, the annotators read machine-generated summaries.) They also found that annotators who knew each other personally showed greater alignment in their interpretations of the stories. And yet, there was still significant variability on how the annotators viewed the stories.

The researchers also created a new dataset of stories in Urdu written in Roman script, which is often used in South Asian countries. They did so to address an assumption that obtaining annotations from native speakers might reduce or mitigate the subjectivity inherent in the task. No work had previously been done in Urdu on this topic.

The Urdu stories were written by OpenAI’s GPT-4o and the researchers asked the model to generate 300 pairs of stories. Four Urdu speakers annotated the pairs following a scheme similar to what was used by the team at MIT, with scores ranging 1 to 4, with 4 indicating the greatest similarity. Again, the researchers found that annotators who knew each other generated similar scores, which lead them to believe that machines’ ability to interpret empathy will be aided by annotators who have significant understanding of local culture and norms.

Empathy across cultures

We live in an age of massive language models in which some systems have been trained on more than 100 languages. While the hope of developers may be that one model can be used by speakers of many different languages, Manzoor, Wang and their colleagues’ research shows that the concept of empathy is highly specific and greatly influenced by cultural context. Moreover, concepts related to empathy are difficult not only for models to interpret but also for humans to agree on.

“We cannot say that LLMs can be empathic at the same time for all people,” Manzoor says. “People from all different backgrounds are using these machines and they might not fully understand how these machines can influence them. We want to make them more intelligent in terms of empathy and maintain the empathic standards for the target culture.”

“To make models more empathetic, and to communicate the way humans do in different situations, is my next step,” Wang adds.

Related

thumbnail
Wednesday, December 18, 2024

AI and the Arabic language: Preserving cultural heritage and enabling future discovery

The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....

  1. atlas ,
  2. language ,
  3. Arabic LLM ,
  4. United Nations ,
  5. Arabic language ,
  6. jais ,
  7. llms ,
  8. large language models ,
Read More
thumbnail
Thursday, December 12, 2024

Solving complex problems with LLMs: A new prompting strategy presented at NeurIPS

Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.

  1. processing ,
  2. prompting ,
  3. problem-solving ,
  4. llms ,
  5. neurips ,
  6. machine learning ,
Read More