What it takes to teach a machine to see in Arabic - MBZUAI MBZUAI

What it takes to teach a machine to see in Arabic

Wednesday, April 22, 2026

Arabic is spoken by more than 400 million people across two dozen countries. It is the language of one of the world’s oldest literary traditions and the working language of major global economies. 

By almost any measure, it is one of the most important languages on Earth and yet, when it comes to frontier models, Arabic has been largely left behind.

The large multimodal models that now dominate AI research, systems that can read documents, interpret medical scans, analyze satellite imagery, and answer questions about photographs, have been built overwhelmingly for English and, to a lesser extent, Chinese. Arabic-language AI has made progress on the text-only front, but models that combine vision and language for Arabic remain scarce. The few that exist tend to cover narrow slices of the language’s rich complexity.

A team of researchers, led by M.Sc. student Ahmed Heakl and Ph.D. student Sara Ghaboura – both from MBZUAI – is attempting to change that. Their model, called AIN (the Arabic INclusive Large Multimodal Model, and also the Arabic word for “eye”), is a 7B parameter system that can process images and text together in both Arabic and English. In a recent technical report, the team claims AIN outperforms OpenAI’s GPT-4o, a model orders of magnitude larger, by an average of 3.4 percentage points across 38 Arabic-language subtasks spanning eight domains. Those domains range from visual question answering and OCR to medical imaging, agricultural disease detection, and satellite-based land use classification.

Beyond this impressive result, another reason that makes the AIN project worth understanding in detail is what the team had to build to get there: a multi-layered pipeline for creating high-quality Arabic multimodal training data from the ground up.

The data problem

Building a multimodal model for English is, comparatively speaking, a data-rich proposition. Millions of image-text pairs exist across the open web, in academic datasets, and in the outputs of prior research projects. For Arabic, no such abundance exists.

The AIN team assembled a dataset of 3.6 million multimodal samples, mixing Arabic and English. About 35% of the Arabic data was what the researchers call “authentic,” meaning it was natively Arabic rather than translated. The rest was produced by translating English-language datasets into Modern Standard Arabic (MSA), and the choices the team made in managing that translation process constitute one of the more interesting contributions of the project.

They began by evaluating three models from OpenAI’s GPT-4 family as translation engines: GPT-4, GPT-4o, and GPT-4o-mini. Native Arabic speakers rated the outputs against human reference translations. The winner was GPT-4o-mini. Evaluators noted that the smaller model translated more consistently and handled proper nouns (brand names like “Boeing,” for instance) more reliably than its larger sibling GPT-4o, which sometimes dropped or garbled them.

But translation alone was not enough; the team designed a verification pipeline with multiple stages. First, they used LaBSE, a language-agnostic sentence embedding model, to compute semantic similarity between each English source sentence and its Arabic translation. They chose LaBSE over other multilingual embedding models after a head-to-head evaluation in which they tested five models on a handcrafted set of Arabic sentences designed to probe specific failure modes: punctuation misalignment between English and Arabic, confusion of grammatical gender (Arabic distinguishes masculine and feminine tone in ways English does not), handling of diacritical marks, and the difference between literal and meaning-preserving translation. LaBSE proved the most reliable at assigning high scores to good translations and low scores to bad ones. Translations falling below 80% similarity were discarded, accounting for less than 2% of the data.

A second verification step reverse-translated the Arabic back into English using GPT-4o-mini and compared the round-tripped result to the original using BLEU, METEOR, and ROUGE. The scores were strong: 86% on METEOR, suggesting the translations preserved meaning well, and above 85% on ROUGE-L, indicating structural fidelity.

Finally, all images in the dataset were screened for toxicity using LLaVA-Guard safety policies in combination with GPT-4o. About 4.4% of images were flagged and removed, distributed across categories including weapons, harassment, animal cruelty, and violence.

The whole pipeline shows that for languages where high-quality multimodal data does not already exist in bulk, careful curation and rigorous verification can substitute for sheer volume. Whether 3.6 million samples is enough to fully capture a language as morphologically and dialectically complex as Arabic remains an open question, but the verification methodology itself is a template other teams could adapt.

Architecture and training

The base model under AIN is Qwen2-VL-7B, a vision-language model originally developed by Alibaba’s Qwen team. The MBZUAI team performed full-parameter fine-tuning on their bilingual dataset, using 64 NVIDIA A100 GPUs distributed across eight nodes. They used flash attention and Liger kernels to reduce memory overhead and followed hyperparameter configurations from LLaMA-Factory, an open-source fine-tuning toolkit. The training setup is straightforward and reproducible, showing how AIN is the product of careful data work applied on top of a capable open source foundation.

Images that users submit to a deployed model rarely arrive clean. Files circulated on the internet are typically JPEG-compressed, and each upload-download cycle compounds the loss. In contrast, training data tends to be curated and artifact-free, creating a distribution mismatch that hurts deployment performance.

To close that gap, the team applied online lossy compression augmentation during training: 25% of images were randomly subjected to multi-round JPEG compression before being fed to the model. By taking this approach, they tried to simulate the degraded quality found in real-world inputs.

How it performs

On CAMEL-Bench, a comprehensive Arabic multimodal benchmark with 38 sub-domains, AIN-7B achieved an aggregate score of 63.77%, compared to 60.13% for GPT-4o and 52.38% for Gemini 1.5 Pro. Its strongest showing was in OCR and document understanding, where it scored 72.35%, far ahead of GPT-4o’s 54.98%. It also outperformed in visual question answering, remote sensing, and the ability to understand charts, diagrams, and tables.

On ArabicMMLU, a text-based Arabic benchmark spanning 19 academic subjects from accounting to physics, AIN improved over its base model Qwen2-VL-7B in 14 out of 19 categories, with a 3-point overall gain. Notably, it did not sacrifice English capability in the process. Across 10 English-language vision benchmarks, AIN improved over Qwen2-VL-7B on every single one, with gains as large as 12 points on MMBench and nearly 6 points on ScienceQA.

The researchers also conducted a human evaluation with more than 200 Arabic-speaking participants from 17 countries. In a blind comparison, where participants did not know which model produced which answer, 76% preferred AIN’s responses, compared to 15% for GPT-4o and 9% for LLaVA. The evaluation spanned domains including food recognition, medical diagnosis, road sign identification, and chart interpretation. In several cases, AIN caught details that human participants missed, such as counting specific structures in satellite imagery or correctly identifying the shape of a food item as a disc rather than a circle.

The dialect question

One of the more revealing findings in the AIN report has little to do with model performance. As part of their human evaluation survey, the researchers asked participants whether they found Modern Standard Arabic suitable for the task, or whether they would have preferred their local dialect.

Nearly three-quarters of respondents said MSA was clear and they preferred it for reading and writing. Another 11% were comfortable with MSA but noted a preference for their own dialect. Only about 4% found MSA genuinely difficult and preferred dialect instead.

This result touches on a long-running tension in Arabic natural language processing. Arabic is not one language but a family of related varieties: MSA, used in formal writing and media, and dozens of regional dialects that differ substantially in vocabulary, grammar, and phonology. A person from Morocco and a person from Iraq may struggle to understand each other’s spoken Arabic, even when both read MSA fluently.

Most Arabic AI work to date has focused on MSA, and AIN follows that convention. The survey data offers some empirical support for that choice, at least in formal and professional contexts. But it also hints at a frontier that MBZUAI has spent the past few years working on: models that can navigate not just the formality of MSA but the lived reality of dialectal Arabic, where most everyday conversation actually happens.

Why AIN matters

The global AI landscape still has a language problem. The models that receive the most investment, the most training data, and the most research attention are those that serve English speakers. But the vast majority of the world’s roughly 7,000 languages remain poorly served or entirely unserved by multimodal AI.

Arabic occupies an unusual position in this landscape. It is neither a low-resource language in the traditional sense (substantial Arabic text data exists online) nor a well-resourced one when it comes to the paired image-and-text data that multimodal models require. AIN’s contribution is to show that a relatively modest investment in data curation, translation infrastructure, and quality control, applied on top of an open-source foundation model, can produce results that compete with or exceed the best proprietary systems on Arabic-specific tasks.

That finding has implications beyond Arabic. If the same methodology can be adapted for Urdu, Bengali, Swahili, or any of the dozens of other languages spoken by hundreds of millions of people, the path to genuinely multilingual multimodal AI may run not through ever-larger English-centric models but through careful, language-specific data work. AIN is one model for one language, but the template it offers could prove more valuable than the model itself.

AIN is also not an isolated effort but part of a broader, deliberate effort by MBZUAI to build the infrastructure that Arabic-language AI has been missing. Alongside the model itself, teams led by Dr. Rao Anwer and Professor Salman Khan have released a constellation of benchmarks designed to pressure-test multimodal systems on tasks that are specifically challenging for Arabic. CAMEL-Bench, the 38-sub-domain evaluation suite used in the AIN paper and accepted at NAACL 2025, was only the beginning.

KITAB-Bench, accepted at ACL 2025, targets Arabic OCR and document understanding across nine domains and 36 sub-domains, probing weaknesses in cursive script recognition, right-to-left text flow, and complex calligraphic features that general-purpose OCR systems tend to fumble. ARB, the Arabic Reasoning Benchmark, goes deeper still: it evaluates step-by-step multimodal reasoning in Arabic across 11 domains with over 5,000 curated reasoning steps, exposing how even frontier models like GPT-4o produce logically inconsistent chains when forced to reason in Arabic rather than English. And DuwatBench, accepted at EACL 2026, tackles what may be the most culturally specific challenge of all: Arabic calligraphy recognition, where AI systems must interpret six classical and modern script styles ranging from the geometric angularity of Kufic to the ornate ligatures of Diwani.

The research community has taken notice. Within a year of its release, AIN has accumulated nearly 1 million downloads on Hugging Face, making it both the fastest-growing and the most downloaded model among the platform’s top 10. The benchmarks are gaining traction as well, particularly given their Arabic and cultural focus: ARB has reached 2,830 downloads,  CAMEL-Bench passed the 30,000 mark, KITAB-Bench was downloaded 27,000 times, TimeTravel (DuwatBench’s dataset) has reached 2,630, and the recently released DuwatBench crossed 1,140 downloads within its first three months. 

Taken together, the picture is of a research group that understood from the start that a single model, however capable, would not be enough. Building AI for an underserved language requires building the entire evaluation ecosystem around it: the benchmarks that define what “good” means, the datasets that expose failure modes, and the cultural and linguistic specificity that off-the-shelf translations cannot provide.

 

Related

thumbnail
Tuesday, April 07, 2026

AI researchers in Abu Dhabi are rewriting the rules of medicine across every stage of life

On World Health Day, MBZUAI showcases how artificial intelligence is transforming healthcare, from predicting Alzheimer’s years in.....

  1. World Health Day ,
  2. health ,
  3. AI ,
  4. healthcare ,
Read More
thumbnail
Tuesday, April 07, 2026

How reinforcement learning can help medical AI systems reason like a doctor

MBZUAI researchers received an NVIDIA Academic Grant for MediX-R1 – a framework that fine-tunes multimodal language models.....

  1. reinforcement learning ,
  2. grant ,
  3. reasoning ,
  4. medical ,
  5. healthcare ,
Read More
thumbnail
Monday, March 09, 2026

Alumni Spotlight: How Abdelrahman Shaker learned to redefine impact in AI

The MBZUAI alumnus explains how his focus has changed from papers to purpose since being awarded his.....

  1. research ,
  2. postdoc ,
  3. impact ,
  4. alumni ,
  5. Ph.D. ,
  6. Alumni Spotlight ,
Read More