In the past several years, scientists in the field of artificial intelligence have made remarkable progress, particularly with multimodal large language models (LLMs) that are designed to process images and text in different languages. Yet while these systems tend to perform well on a handful of languages, such as Chinese and English, their performance suffers when tasked with most of the nearly 7,000 languages that are spoken across the world.
In the effort to improve multimodal LLMs on a wide array of languages, scientists from MBZUAI and other institutions have developed a new benchmark dataset for evaluating their performance. The benchmark focuses on what are known as cultural visual question-answer tasks, which require models to make sense of both textual and visual information and the cultural context and meaning they convey.
Named the All Languages Matter Benchmark, or ALM Bench, the dataset evaluates multimodal LLMs on 100 languages. Many are low-resource languages – languages for which there are few digital resources to train models. The authors of the study describe ALM Bench as the largest and most comprehensive effort for measuring the performance of multimodal LLMs.
In addition to developing ALM Bench, the team also tested 16 open- and closed-source multimodal LLMs on it. The results revealed a significant need for greater cultural and linguistic inclusivity for these systems, particularly with low-resource languages.
“There has been a lot of work to evaluate the performance of models on a handful of high-resource languages, but there are many languages that these models had never been tested on,” says Fahad Khan, deputy department chair of computer vision and professor of computer vision at MBZUAI, and co-author of the study. “We wanted to know how inclusive these models are in terms of visual reasoning.”
Khan and his co-authors note that for multimodal LLMs to truly serve people across the globe, developers must improve them so that they exhibit capabilities beyond linguistic fluency and gain an understanding of the many nuances of language and culture that shape human experience.
The ALM Bench dataset includes more than 22,000 question-answer pairs across 19 categories such as sports, literature, media and food. Languages range from high resource, like Spanish and French, to underrepresented, such as Cebuano and Kyrgyz. They also included three Arabic dialects: Egyptian, Emirati and Saudi.
The benchmark’s cultural breadth and depth is determined by images and questions designed to reflect local traditions, values and practices. The researchers collected culturally relevant images from the Internet that relate to languages covered in the dataset. The researchers prompted OpenAI’s GPT-4o to generate questions in the target languages based on the images. The questions were reviewed for accuracy by speakers of the languages. Sixty annotators participated in the project and 80% were native speakers.
“This was a broad effort and we wanted to create a benchmark where we can analyze and evaluate different vision language models, both open-source as well as closed-source, to understand where the gaps are, which languages are hard for these models and which cultural aspects they have difficulty with,” says Salman Khan, associate professor of computer vision at MBZUAI and co-author of the study.
Scientists have built visual question answer datasets in the past, but ALM Bench includes more languages than previous efforts. Other benchmarks are typically made up of multiple-choice questions. ALM Bench has multiple choice, true-false and open-ended questions, which allows for a more comprehensive test for multimodal LLMs.
The study’s findings highlight a significant performance gap between open-source and closed-source models. Proprietary systems, like GPT-4o, consistently outperformed their open-source counterparts, achieving higher accuracy across both high- and low-resource languages. GPT-4o achieved an overall accuracy of 78.8%, while the best open-source model, GLM-4V-9B, developed by scientists from Tsinghua University and Zhipu AI, achieved an overall accuracy of 51.8%.
Even the best-performing models, however, struggled with low-resource languages, particularly those spoken in Africa and South Asia. For example, while GPT-4o received an accuracy score of 88.4% on English, it only reached 50.8% on Amharic, which is spoken in Ethiopia.
Salman Khan explains that, overall, the difference in performance on ALM Bench between the open- and closed-source models was surprisingly large. There are other benchmarks that test abilities such as image understanding, complex reasoning based on the interpretation of charts, tables and infographics, and other skills. On these tasks, he said, there isn’t such a stark gap in performance between the open and closed models. In fact, in some specific cases, the open models perform better than the closed models. This is not the case on this cultural visual question answer benchmark.
Though it’s not possible to know for sure why the closed-source systems performed better, Salman Khan explains that it’s most likely because the models were exposed to more data related to low-resource languages during training.
The researchers also found that including images in their queries significantly improved model accuracy. Without images, GPT-4o’s performance dropped by 27% across languages.
Generative AI systems are increasingly relied upon as digital assistants, creative collaborators and sources of information. But for these systems to be universally useful, they must respect the cultural and linguistic contexts of their users. “It’s necessary that these systems understand local cultures, the norms of societies, the seminal figures and the traditions and practices people in different places follow,” Salman Khan says. “All of these need to be inherently embedded in any generative AI model.”
The performance of the models on ALM Bench illuminates their weaknesses not only on specific languages but on entire language families, such as Atlantic-Congo and Turkic languages. “With this effort, we wanted to understand which language families are relatively less represented in the current datasets related to vision language understanding as well so that we can hopefully bridge that gap,” Fahad Khan says.
The development of ALM Bench highlights the need for more diverse and representative training data in more of the world’s thousands of languages. Current datasets are heavily skewed toward high-resource languages and Western cultures, limiting their applicability to communities across the world. Addressing these gaps will require sustained efforts from researchers, industry, and, of course, the language communities themselves.
The students won the best student paper runners up award at ACCV for their new method called.....
Scientists at MBZUAI have developed a new method of predicting survival times for head and neck cancer.....
MBZUAI's Chao Qin explains how he retrained models with adapters to boost performance, and what this could.....