Large language models, like OpenAI’s GPT-4 and Meta’s LLaMA, are often trained on data from many different languages, sometimes dozens. Indeed, when queried during the writing of this story, ChatGPT 4 said that its training data included more than 100 languages, while Meta AI said that LLaMA had been trained on more than 50.
But just because a model was trained on data from a particular language doesn’t mean the model is competent with the language. To gain an understanding of models’ capabilities, scientists develop what are known as benchmarking datasets that serve as standardized tests for evaluating performance. With benchmarking, scientists can compare the performance of different models, identify their respective strengths and weaknesses and make informed decisions about how they can be improved. And by evaluating a model over time on the same benchmark, researchers can observe how, or if, a model is improving.
There is a significant need for benchmarks in languages other than English, said Fajri Koto, assistant professor of natural language processing at the Mohamed bin Zayed University of Artificial Intelligence. “Because of these missing datasets, people haven’t been able to properly evaluate the performance of these models in languages beyond English.”
Koto and researchers from MBZUAI and other institutions have recently compiled the first benchmark dataset in Modern Standard Arabic that evaluates language understanding across many tasks. The study is being presented at the 62nd Annual Meeting of the Association for Computational Linguistics, which is being held this week in Bangkok.
Previously, the only datasets that were available to assess LLM knowledge and reasoning capabilities in Arabic had been translated from English. This limits their effectiveness, as translation introduces errors and misses the cultural contexts that are specific to Arabic-speaking regions. “English has been extensively used to evaluate the biggest LLMs,” Koto said. “But as these datasets are English-centric, the cultural context is more like an American one” that isn’t relevant for use in the Arab world.
Koto and his coauthors call their dataset ArabicMMLU, as it’s based on an approach called massive multitask language understanding that was conceived by a team of researchers at University California, Berkeley and was designed to test models’ ability to answer multiple-choice questions in a variety of subjects.
ArabicMMLU contains more than 14,000 multiple-choice questions gathered from school exams across the Arabic-speaking world. Koto and his colleagues compiled the dataset with the help of native Arabic speakers from Egypt, Jordan, Lebanon, Saudi Arabia and the UAE. More than half the questions relate to what the researchers call “Arabic-specific contexts.”
ArabicMMLU evaluates two aspects of LLMs. First is knowledge, which is a measure of what a model has learned and memorized during training. An example of knowledge is that a model knows that Abu Dhabi is the capital of the United Arab Emirates. Second is reasoning, which relates to a model’s ability to generate a new understanding based on factual knowledge. An example of a question that requires reasoning from the study is: “The number of hundreds and tens in the number 700 is?”
“The original purpose of the multiple-choice questions that are included in ArabicMMLU was to test students’ reasoning ability about certain kinds of knowledge,” Koto said. “We want to know if a model can develop new conclusions” when they encounter the questions in the dataset.
When testing a model’s reasoning capabilities, researchers try to determine if models answer questions correctly because they are reasoning, or simply because they have seen the same questions previously, Koto said. This determination, however, is complicated by the fact that the biggest LLMs have been trained with huge amounts of data, and it’s difficult to know if a model has been exposed to a particular question in the past.
In the study, the researchers evaluated 35 language models: 22 open-source multilingual modes, 11 open-source Arabic models and two closed-source Arabic models.
On zero-shot question answering, meaning that it was the first time the model responded to the question, the researchers found that GPT-4 beat out all other models, answering 72.5% of questions correctly across subjects. That said, the team is unsure if GPT-4’s strong performance was due to superior reasoning capabilities, or simply because it had been exposed to the data before and memorized it.
Jais, an Arabic LLM developed by MBZUAI and Inception, was the top performing open-source model, answering 62.3% of questions correctly. What’s more, Jais outperformed GPT-3.5, which is significant, as Jais is significantly smaller than GPT-3.5.
Yet, overall, the study showed that open-source LLMs performed poorly on the evaluation. This was particularly true for the multilingual open-source models. Models designed specifically for Arabic displayed difficulty on questions related to cultural knowledge.
The researchers also explored how models performed on data from specific countries. This kind of analysis could be helpful to developers building LLMs for use in a specific country or group of countries, Koto explained. “ArabicMMLU can be a proxy for developers who want to know how well a model would perform in a particular region,” Koto said.
While ArabicMMLU is a significant advancement in the evaluation of Arabic language models, there is still a major need for high-quality, non-translation datasets. “Progress in languages other than English has not advanced as quickly as it needs to,” Koto said.
The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....
Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.
A team from MBZUAI is improving LLMs' performance across languages by helping them find the nuances of.....