How dialectal pretraining improves Arabic automatic speech recognition

Wednesday, August 06, 2025

More than 400 million people speak Arabic, making it one of the most widely spoken languages in the world. Modern standard Arabic (MSA) is a standardized version of the language used in media, government, universities, and other institutions across more than 20 countries where Arabic is an official language. But the dialectal Arabic that people speak in their daily lives differs significantly from MSA and varies greatly from one region to the next.

Researchers have developed automatic speech recognition (ASR) systems designed specifically to interpret Arabic, but the great diversity of dialects has posed a challenge for these systems. An added challenge is that speakers, depending on where they are, might pepper their conversations with words from other languages, like English, French, and Spanish, an activity known as code switching. This multilingual aspect makes building ASR systems for Arabic even more difficult.

Researchers from MBZUAI have been working to address this challenge and have developed a suite of ASR models that are designed to process MSA and dialectal Arabic in the presence of code switching. The team presented their findings at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) in Vienna.

Why Arabic dialects are hard for automatic speech recognition

Arabic dialects are more than regional variations on the standardized version. Some are so different that they could be understood to be different languages, says Amirbek Djanibekov, a Ph.D. student in Natural Language Processing at MBZUAI and co-author of the study.

For the most part, dialects are spoken and not written. When they are written, there is no standardized form, making training AI models difficult. The amount of data available to train models varies significantly across dialects as well, and correctly labeling them is a challenge, explains Hawau Olamide Toyin, also a Ph.D. student in Natural Language Processing at MBZUAI and co-author of the study. Dialects are typically labeled by geographic region or country, but these labeling schemes aren’t always adequate. “In the UAE itself, there are variations in dialect from one emirate to another,” she says.

Raghad Alshalan, Abdullah Alitr, and Hanan Aldarmaki are co-authors of the study.  

Improving dialectal performance

Previous studies showed that training an ASR system on MSA would give it good performance on that variant of the language. But Djanibekov, Toyin, and their co-authors were interested to see how they could develop a model that could interpret both MSA and dialectal Arabic.

The researchers collected data on several variants of Arabic, including MSA, classical Arabic, and dialects from North Africa, the Levant, and the Gulf.

They started with an ASR model called Arabic Speech and Text Transformer (ArTST) and trained three versions of the model: one only on MSA data, a second on MSA and dialectal data, and a third on MSA, dialectal, and multilingual data.

They tested the MSA and MSA-dialect pretrained versions under several different scenarios to determine the impact that dialectal pretraining had on performance on both MSA and dialects.

To test performance on MSA, they fine-tuned the two versions on an MSA dataset called MGB2 and compared their performance to state-of-the-art models. Their MSA-dialect model had the lowest word error rate, illustrating that dialectal pretraining didn’t hurt performance on MSA.

To test on dialects, they fine-tuned the two versions on an Egyptian dataset called MGB3 and compared them to other models. Again, theirs were the best performing models, with the MSA-dialect pretrained version beating out the MSA pretrained version by 4%. It’s the best performance on the benchmark to date. Tests on a Moroccan dataset showed smaller gains.

They also tested the two versions on dialect-specific benchmarks with and without dialectal finetuning and found that on average dialectal pretraining improved performance across these tests.

Impact of fine-tuning on dialects

In all these tests, the models were fine-tuned on individual dialects and tested on those specific dialects. Would fine-tuning on several dialects make the model even better?

To find out, the researchers fine-tuned the model across 12 dialects and developed other versions that used a dialect identifier called dialect ID. In one case, the system had to generate tokens based on a dialect label. In another, it had to generate tokens according to a particular dialect, but it could determine what dialect was most appropriate.

Results showed that fine-tuning on many dialects improved performance on low-resource dialects but hurt it on high-resource dialects.

Though using dialect ID was effective, the model was better when it inferred the dialect instead of being told the dialect. Toyin says this could be due to a data-labeling issue. “If you tell the model the wrong dialect, it will predict the wrong thing,” she says. If it is left to infer the dialect, “it can correct itself and go in the direction it thinks best.”

Finally, the researchers tested the code-switching abilities of the model. They found that multilingual pre-training improved performance across all code-switching tests, resulting in more than 10% absolute reductions in word error rate. That said, there was a small decrease in capability on the MSA dataset and language interference was significant when it came to the dialects, with word error rate ranging from 4% to 16%.

Next steps for dialectal automatic speech recognition

Though the results show how dialectal pretraining can be used to improve the performance of ASR models, Djanibekov and Toyin say that there is still work to be done to improve these systems.

A big part relates to building new and better datasets focused specifically on the wide range of Arabic dialects. Toyin thinks that standardization of dialectal data would help researchers, and she hopes that as researchers build new datasets, they will do so in a way that provides accurate labels. She is currently working on a project to develop resources for Emirati and other dialects.

For Djanibekov, the project opened him up to the great variability of dialects and the implications this has for ASR not only for Arabic, but for other languages as well.

Related

thumbnail
Monday, August 04, 2025

How jailbreak attacks work and a new way to stop them

Lang Gao and fellow MBZUAI researchers have developed a new approach to improve the safety of LLMs.....

  1. natural language processing ,
  2. research ,
  3. security ,
  4. nlp ,
  5. Safety ,
  6. ACL ,
  7. jailbreaking ,
Read More
thumbnail
Wednesday, July 30, 2025

Measuring cultural commonsense in the Arabic-speaking world with a new benchmark

A team from MBZUAI and other institutions presented a new dataset at ACL, aiming to help AI.....

  1. ACL ,
  2. benchmark ,
  3. culture ,
  4. Arabic language ,
  5. dataset ,
  6. nlp ,
  7. research ,
  8. natural language processing ,
Read More
thumbnail
Monday, July 28, 2025

Overcoming the 'reversal curse' in LLMs with ReCall

A team from MBZUAI will present their research on 'self-referencing causal cycles' at ACL, in a bid.....

  1. performance ,
  2. transformer ,
  3. reversal curse ,
  4. ACL ,
  5. language ,
  6. language models ,
  7. large language models ,
  8. nlp ,
  9. machine learning ,
Read More