Vision language models (VLMs) are systems that are designed to process both visual data — like images and videos — and natural language, providing possibilities for users to engage them in tasks across these different domains. For example, by using a simple prompt, a user can instruct a VLM to generate an image of a dragon. Alternatively, a user could upload an image of a dragon to the VLM and instruct it to generate a description of the image, a process known as captioning. There are, of course, many other potential uses of VLMs beyond these basic examples.
Over the past few years, developers have released better and more capable VLMs. These include OpenAI’s GPT-4V and Google’s Gemini. At the same time, researchers are working to evaluate the capabilities of these systems and to gain a sense of what they can and cannot do.
A recent study by researchers at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) introduced a new benchmark dataset for evaluating a VLM’s ability to reason and process text and visual information. The research was presented at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) in Bangkok, one of the most important in the field of natural language processing. It was one of 41 studies MBZUAI scientists shared at the conference.
Benchmark datasets are critical for evaluating the capabilities of artificial intelligence systems. Researchers who develop benchmarks for large language models (LLMs), including some from MBZUAI, have recently turned to compiling datasets that are comprised of school exams. In a similar way as they do for people, exam questions are good tests for models because they assess broad capabilities, such as language understanding, real-world knowledge, and the ability to reason. While there are several datasets based on school exams that have been created to evaluate the performance of LLMs, there are no comparable datasets designed specifically for VLMs.
“There is no other dataset like this,” said Preslav Nakov, a coauthor of the study and department chair of natural language processing and professor of natural language processing at MBZUAI. “In other datasets, images and text may be separate, but with this dataset, we provide the whole question in an image to the model.”
The dataset, called EXAMS-V, is extremely broad and is a “multimodal extension” of another dataset, called EXAMS, designed for LLMs that was aggregated from standardized tests. EXAMS-V includes more than 20,000 multiple-choice questions across 26 subjects — such as physics, chemistry, history, geography and math — and 11 languages, including Arabic, Bulgarian, Chinese, Croatian, English, German and Russian. Rocktim Jyoti Das and Haonan Li of MBZUAI are coauthors on the study.
“When we spoke with people who are building vision language models, they told us that this kind of dataset would be the real test of their capabilities,” Nakov said.
EXAMS-V includes tables, figures, graphs, symbols and text. The models were asked to analyze all this information in context and choose the correct answer to the question
The researchers presented questions to VLMs by providing them with the image that contained the question and with instructions to answer the question. “To answer the question correctly, the model must go beyond simply understanding what is in the image,” Nakov said.
In addition to the VLM test set, the researchers created another group that contained LLMs, which don’t have the capacity to analyze images. The questions therefore needed to be modified. With the LLM group, the researchers used a technique called optical character recognition (OCR) to extract the text from the images and used a VLM to generate captions for the images. The extracted OCR text and the captions were then fed into LLMs with instructions to choose the correct answer.
The performance of the models varied significantly across languages and subjects. GPT-4V attained 62% accuracy on the Italian questions and as low as 22% on the Chinese questions. Gemini’s best performance was in German (48%) and its worst was on Arabic (19%). Since there are four answers to each question, baseline performance is approximately 25%. Model performance seemed to correlate with the difficulty of the questions, such as the kinds of figures — tables, charts, graphs — that the questions contained.
In some cases, the LLMs that were provided the OCR text and captions performed better than the VLMs. Though the data comparing the performance of the VLMs and the LLMs isn’t conclusive, Nakov believes that it suggests that “the VLMs aren’t such strong reasoners. They are trained to understand what is in an image, but they aren’t good thinking about it.”
The researchers were also able to conduct what is called a parallel data examination, meaning they had the same questions in more than one language — in this case, Croatian, Italian and Serbian. In theory, the VLMs should perform equally as well across these three languages. However, they did not.
Croatian and Serbian are mutually intelligible languages, albeit written in different scripts, with Croatian written in Latin and Serbian written in Cyrillic. Yet the models performed much better on Croatian than they did on Serbian. “It seems that if you just give the model the Serbian question, which is written in the Cyrillic alphabet, the model doesn’t understand the alphabet as well,” Nakov said.
The LLMs, however, performed better than the VLMs on the Serbian dataset, perhaps because the OCR-extracted text is easier for them to process.
Nakov noted that there are significant efforts to build models with multimodal and multilingual capabilities and that it is important to develop benchmarks that will be able to accurately evaluate them.
And in the future, the need for datasets that test a wide range of capabilities will become even more important. “If you want models that are capable of solving real-world tasks, not artificial ones, these models will require a lot of knowledge across many different disciplines,” he said. “If you want to see if a model is truly intelligent, one way is to give it many different tasks that test many different capabilities.”