Home / News / A new approach to improve vision-language models

A new approach to improve vision-language models

Wednesday, February 21, 2024

At a fundamental level, humans and artificial intelligence applications face a similar dilemma. No matter our preparation, or the knowledge and experience we have amassed, both human and machines will confront unforeseen circumstances. Managing the contingency and messiness of the world is one of the great challenges developers face when building artificial intelligence applications.

Not all AI applications need the capability to process new and unexpected information. However, those designed for open-world scenarios must manage what is known as out-of-distribution data — new information not used in training the application.

“In open world scenarios, the agent not only needs to generalize to out-of-distribution data but is also required to continuously gain more experience from these data,” said Sheng Zhang, a research assistant at MBZUAI and master’s degree recipient from the university. “We can’t guarantee that all possible observations or combinations an agent experiences have been encountered during the training stage, and there will likely be novel data that appears in the environment.”

This capability to manage new scenarios is essential, since it’s impractical to anticipate and collect training data for all possible circumstances that an application may encounter in the real world.

For example, in autonomous driving a model that is trained in one geographic location might struggle to generalize to environments with different road signs, weather conditions or driving conventions. Similarly, in medical imaging, a model trained on data from one population might not perform well on data from a different population with distinct demographic or clinical characteristics.

Zhang and scientists at MBZUAI have authored a study which proposes a novel problem and solution to improve the openness and generalizability of vision-language models. The work will be delivered in an oral presentation at the 38th Annual AAAI Conference on Artificial Intelligence in Vancouver, Canada.

Additional authors of the study are Professor of Computer Vision Fahad Khan, Professor of Machine Learning Kun Zhang, Associate Professor of Computer Vision Salman Khan, Assistant Professor of Machine Learning Zhiqiang Shen, Postdoctoral Research Fellow Guangyi Chen and Research Scientist Muzammal Naseer, all of MBZUAI.

Vision quest

Vision-language models are AI applications that combine natural language processing and computer vision capabilities, bridging the gap between linguistic and visual domains. These models create associations between words (for example, “dog,”) and related images (pictures of dogs).

This is a significant challenge for AI as it requires a comprehensive understanding of both human language and the visual world, and how representations in these different domains relate to each other.

There are two learning paradigms of existing vision-language models — generative and contrastive. Generative vision-language models can be used to create descriptive captions for images, answer questions about the content of photographs or even produce images based on textual descriptions. Contrastive vision-language models align images and textual descriptions in a joint embedding space.

An example of a contrastive vision-language model is CLIP (contrastive language–image pre-training), developed by OpenAI. After pretraining, CLIP can categorize images in a zero-shot manner, that is, by aligning images to the pre-defined vocabulary in the shared representation space, even though some categories had not been seen during training.

Zhang, however, believes that the reported high performance of CLIP in zero-shot learning scenarios is too optimistic, partially due to the widely adopted unrealistic assumption of the vocabulary.

“They still use a close-world assumption in that in the given data set, the ground-truth label sets and the number of classes are already assumed,” Zhang said. “In reality, when we collect a massive image dataset, we can’t assume that we already know all the class names of objects in the images before exhaustive annotation. Therefore, such an ideal vocabulary does not exist.”

There are other cases like this as well. “Many popular settings of vision-language learning are not realistic enough because they rely on closed world assumptions like this. These include zero-shot learning, prompt tuning and open-vocabulary learning. This problem hasn’t been enough emphasized by the community, and we wanted to point it out,” he said.

Welcome to the real world

The researchers propose the novel problem they seek to address as realistic zero-shot classification, meaning zero-shot classification with a more realistic and relaxed assumption on the vocabulary for vision-language models. Their goal is to provide vision-language models with the zero-shot recognition ability in open-world scenarios to identify categories for images without the need for annotations made by a human.

They call their proposed solution self-structural semantic alignment (S3A), and it is designed to be more practical and adaptable to real-world scenarios than current methods. Their framework enhances model adaptability by iteratively refining semantic relationships between images and a vast vocabulary, improving accuracy in identifying unseen categories without human-annotated labels, by leveraging a vast vocabulary beyond traditional closed-world assumptions.

The problem, however, becomes significantly challenging after scaling up into a realistic vocabulary. “We’re focused on identifying intrinsic semantic information from the graph structures in the image embedding space by forming clusters, which then reveal shared semantic meanings,” Zhang said. “By mapping these clusters against a broad vocabulary, we pinpoint the precise category for each group.”

The process involves grouping images based on their embeddings, associating these groups with potential category names, using large language models to refine these associations and then realigning the categories with the images based on this updated understanding.

“Our strategy involves utilizing the extensive capabilities of large language models to inject additional visual context for each category name,” Zhang said. “The generated information is aimed at distinguishing between closely related images, prioritizing the unique characteristics of similar objects.”

This method allows for a significant improvement in the ability to classify images into categories that were not seen during the training of the model, overcoming the limitations of previous approaches.

In the study, the S3A framework proposed by the team was shown to outperform state-of-the-art models in various experiments, showing its effectiveness in not just recognizing a wide range of categories but also in handling categories that fall outside its extensive vocabulary.

Zhang emphasized the need for more stringent and realistic evaluations within the community. “Our objective is to highlight the inadequacies of current assessments and underscore the importance of developing foundational models capable of serving as intelligent aids in real life,” he said. “Addressing the challenge of generalizing to open-world scenarios remains unresolved, necessitating collective efforts to equip intelligent systems with this capability in the long run.”

Monday, June 23, 2025

A new approach to improve vision-language models

Vision quest

Welcome to the real world

Related

A compact multimodal model for real-time video understanding on edge devices

A two-stage approach for making AI image generators safer | CVPR

Improving diagnosis of neurodegenerative diseases

A new approach to improve vision-language models

Vision quest

Welcome to the real world

Related

A compact multimodal model for real-time video understanding on edge devices

A two-stage approach for making AI image generators safer | CVPR

Improving diagnosis of neurodegenerative diseases

Subscribe to The Node