As an area of study, culture has traditionally been the quarry of researchers in the humanities and social sciences. But the recent rise of large language models and their daily use by millions of people across the globe has raised important questions about how these systems represent the world’s cultures.
Researchers from MBZUAI are authors of two new studies that consider how LLMs navigate cultural difference. The papers were recently presented at the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL). One of the studies, which focuses on a concept known as meta-cultural competence, won the SAC award for Special Theme at the meeting.
A study titled “Reading between the lines: Can LLMs identify cross-cultural communication gaps” explores the potential for LLMs to be used to help people understand concepts from different cultures. In the study, experiments measure the ability of GPT-4o, a model developed by OpenAI, to identify cultural references in book reviews from the website Goodreads.
The authors of the study are postdoctoral researcher Sougata Saha, research associate Saurabh Kumar Pandey, and professor of natural language processing Monojit Choudhury, all of MBZUAI. Harshit Gupta of IIIT Hyderabad is also an author.
The team chose Goodreads as a source for several reasons. Books are cultural artefacts, and reviews often contain rich cultural references. Goodreads reviews are also generally standardized, making them useful for analysis.
To begin, the researchers had evaluators from India, Mexico, and the US read reviews of books from Ethiopia, India, and the US and asked the evaluators to identify sections of text in the reviews that could be difficult to understand. They also asked the evaluators to measure how many of these snippets contained cultural concepts, called culture-specific items (CSIs) in the study. This allowed the team to develop a benchmark dataset that could be used to measure the ability of models to identify these cultural concepts.
Fifty evaluators participated in the study and annotated 57 book reviews. According to the evaluators, 83% of the reviews had at least one CSI. These could be figures of speech or common metaphors. For example, the term “home run” is often used in American English to describe something that is done well. The phrase “topper” is used in India to describe a student at the top of their class. Both can be hard for people who are from different cultures to understand.
“We wanted to see if there were common themes that people didn’t understand across cultures,” Saha says. “Aggregation of these points of confusion allows us to see the bigger picture related to understandability across cultures.”
Previous studies have considered the implications of cross-cultural communication gaps, but this was the first to explore how LLMs can be used to identify gaps and aid in cross-cultural communication, Pandey says.
The human evaluations of the reviews formed what is known as a gold dataset. The researchers compared these findings to the performance of an LLM. To do this, they used a technique called sociodemographic prompting, where they asked GPT-4o to take on personas of people from the same countries that the evaluators came from (India, Mexico, and the US). Based on these personas, the researchers asked the model to categorize the difficult passages as either CSIs or non-CSIs.
The model didn’t perform well, achieving a precision score of 49% and a recall score of 65%. Not only did the model miss many CSIs, but it also produced a significant number of false positives, labeling items as culturally specific when they were in fact not. That said, the system performed similarly across cultures, which Saha found surprising, as previous research has shown that LLMs are Western centric. “We found that they do equally bad across cultures,” he says.
Pandey said that the findings indicate that CSIs are a common set of items that are distinct across cultures. “Our findings can help us develop tools and applications that can facilitate cross-cultural communication,” he says.
They’ve already built one, called Culturally Yours, which is designed to help users look up cultural references on webpages. There’s the potential to build other tools, like translation systems and recommendation engines, that benefit from these insights, Pandey says.
Even though the performance of the model was equitable, the researchers note that there is much room for it to be improved.
Saha, Pandey, and Choudhury are also the authors of an award-winning position paper presented at NAACL called “Meta-cultural competence: Climbing the right hill of cultural awareness,” which proposes a new way to build culturally competent models. Both studies were supported by an Accelerating Foundation Models Research grant from Microsoft Research.
Natural language processing researchers have long argued for the need for models that are culturally competent, Saha explains, but few studies have proposed concrete ways of making this happen. Some think that models should incorporate facts related to all cultures into their models. “But if you think about it, people don’t know everything about all cultures, but we are somehow culturally competent,” Saha says.
The researchers argue that instead of trying to teach LLMs about specific cultural facts or norms, developers should focus on building models that display what is known as meta-cultural competence. This would allow models to adapt to unfamiliar cultures, even when the models haven’t seen examples from these cultures before.
Meta-cultural competence includes two key capabilities: variational awareness, or the abilities to recognize that cultural difference exists and make sense of these differences; and explication and negotiation, or the abilities to ask clarifying questions, or explain uncertainty, when cultural understanding is incomplete.
In the study, the researchers question today’s approaches — like fine-tuning — as stopgap measures that don’t scale to the huge diversity of global cultural variation. “There are so many things that a model should be mindful of because there are so many different types of people in the world,” Saha says.
Instead, they argue for developing AI systems that can detect differences in cultures and adapt dynamically, much like a person would when navigating an unfamiliar environment. “We propose a new framework for engendering culturally aware AI systems and present lots of open questions for the community to pursue,” he adds.
As the team continues this line of work, the goal is not just to make AI more factually informed, but more culturally sensitive and adaptive. “I’m always interested in the question of who we are, where we came from, and where we are going,” Saha said. “I am always trying to understand these questions better, and if LLMs can help, they are worth exploring.”
Researchers from MBZUAI aim to help developers build safer and more culturally aware AI tools for Arabic.....
Research led by Kaushal Maurya won an award at NAACL for its efforts to measure the teaching.....
A team from MBZUAI tested the performance of models in non-Western music, with the aim of making.....