Home / News / Measuring cultural commonsense in the Arabic-speaking world with a new benchmark

Measuring cultural commonsense in the Arabic-speaking world with a new benchmark

Wednesday, July 30, 2025

Commonsense reasoning describes the ability of an AI system like a language model to grasp a concept that comes naturally to humans. It’s a challenge that researchers have often tried to understand and address. For example, a few years ago, text-to-image generators were known for producing pictures of people with more than five fingers on each hand. People intuitively understand the connections between concepts like hunger and food, fatigue and sleep, but these relationships don’t come so easily to machines.

Often in the field of AI, commonsense reasoning has been thought of as being universal to the human experience, but a new study by researchers at MBZUAI and other institutions illustrates how the commonsense reasoning capabilities of language models vary across cultures of the Arabic-speaking world.

“AI systems are moving towards personalization and if I want a model to respond to me as an individual and as an Egyptian, I want the model to capture the unique nuances of my culture,” explains Abdelrahman Sadallah, a graduate of the master’s program in Natural Language Processing at MBZUAI and co-author of the study.

Sadallah and his co-authors compiled a new benchmark dataset that measures commonsense reasoning capabilities on diverse cultures of the Arab world. The dataset, called ArabCulture, is the largest of its kind and was built by native Arabic speakers. The researchers tested 31 language models on ArabCulture and found that many of the systems came up short when it comes to understanding cultural concepts across the region.

The researchers are presenting their findings at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) in Vienna. Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, and Fajri Koto are co-authors of the study.

Building a new and unique dataset in Arabic

In their study, the researchers explain that cultural diversity influences not only the social interactions among people, but also the ways in which they reason and think about the world. Most benchmarks that are used to test cultural understanding don’t capture these nuances.

This motivated Sadallah and his colleagues to create a new dataset that is focused specifically on cultural concepts in the Arab world. They hired workers from 13 Arabic-speaking countries spanning from North Africa to the Gulf who all possess a deep understanding of their respective cultures.

A new dataset built from scratch by native Arabic speakers covers four regions (North Africa, the Nile Valley, The Levant, and The Gulf) and 13 countries.

Written in modern standard Arabic (MSA), the dataset “is built from scratch and comes from the minds of the workers,” Sadallah says. Building a dataset from scratch — and keeping it off the internet — minimizes the possibility of what’s known as data leakage, where a model is tested on a dataset that it’s already seen, making its performance look better than it really is.

It’s possible to translate a dataset from one language to another and ‘localize’ it, which involves changing the names of people and things into ones that are common in the culture of interest. But the context is still lost by doing this.

Koto, Assistant Professor of Natural Language Processing at MBZUAI and co-author of the study, provides an example: “Is it ok for you to sit on the floor and eat with your hands at a wedding? This may be ok in some cultures, while it wouldn’t be in others.”

The questions in ArabCulture — nearly 3,500 of them — are sentence completion tasks. Each is made up of a one-sentence premise and three answer choices that make sense in terms of logic and syntax. But only one of the three answers is correct because it’s the only one that is appropriate within the cultural context. The dataset covers 12 topics relevant to daily life, such as food, weddings, family relationships, and agriculture. These are broken down into 54 subtopics that are culturally relevant to the Arab world, including iftar and burying rituals.

Examples of multiple-choice questions from the ‘lunch’ category in the ArabCulture dataset. Language models are given a sentence and asked to answer by picking the sentence that should follow it. While all three answers are logically and syntactically correct, only one is appropriate when considering cultural context.

What do language models know about Arab culture?

The researchers tested multilingual, Arabic-only, and open- and closed-weighted models. While the questions were in Arabic, the researchers compared the performance of the models using Arabic prompts and English prompts. They also tested them according to two different strategies: sentence completion and multiple choice.

GPT-4o, a closed-weight model, performed better than the open-weight models. (Since it’s closed-weighted, it could only be tested in the multiple-choice setting.) When GPT-4o was given the region and the country that the question related to, it answered 90% of them correctly. The next-best performing models were Qwen2.5-Instruct at 80% and LLaMA-3.3-Instruct closely behind at 79.6%.

The researchers found, perhaps surprisingly, that multilingual models performed better than Arabic-specific models. There was no correlation between size and performance, which makes them think that there are other factors, such as the data used for pretraining and model architecture, that influenced performance.

Some models performed better in certain areas than in others. For example, GPT-4o performed best in agriculture and family relationships. AceGPT’s best performance was on topics related to death and idioms. Qwen topped out on agriculture and traditional games.

Performance was influenced by geography as well. Questions from Jordan were answered with high accuracy (90%) across models. While performance was lower for questions about Lebanese and Tunisian cultures.

The models performed best when they were prompted in English. Though this behavior has been reported in other studies, Sadallah said he was interested to see it occur in this setting.

Building more culturally aligned models

Overall, the researchers say that their findings illustrate the need to improve models to improve their understanding of Arabic cultural contexts.

But how can this be done?

Tonga, a research assistant at MBZUAI and co-author, says that it might be possible to improve performance by using a larger model to provide cultural context to a smaller model. There’s evidence for this approach: the performance of Qwen and LLaMA increased when they received more context about the cultures in question.

Koto says another approach would be to use what’s known as a preference dataset to provide instruction to a model and improve cultural alignment in post-training.

However it’s done, Tonga notes that developing AI systems that are aligned with the world’s diverse languages and cultures must be a priority: “Since AI should be used by everyone, AI needs to understand everyone’s cultures.”

And while ArabCulture is written in MSA, which is used in the media, government, and other official communications, Sadallah believes that a future direction may be to build another dataset (or sets) that consider the many dialects that are spoken throughout the Arab world, which could be used to evaluate models at an even greater level of granularity.

Friday, December 05, 2025

Measuring cultural commonsense in the Arabic-speaking world with a new benchmark

Building a new and unique dataset in Arabic

What do language models know about Arab culture?

Building more culturally aligned models

Related

K2: An open source model that delivers frontier capabilities

Truth from uncertainty: using AI’s internal signals to spot hallucinations

What reinforcement learning can teach language models about reasoning

About

Resources

Programs

Calendars

Measuring cultural commonsense in the Arabic-speaking world with a new benchmark

Building a new and unique dataset in Arabic

What do language models know about Arab culture?

Building more culturally aligned models

Related

K2: An open source model that delivers frontier capabilities

Truth from uncertainty: using AI’s internal signals to spot hallucinations

What reinforcement learning can teach language models about reasoning

Subscribe to The Node