Fine-grained species recognition with MAviS: a new dataset, benchmark, and model - MBZUAI MBZUAI

Fine-grained species recognition with MAviS: a new dataset, benchmark, and model

Thursday, November 27, 2025

There are nearly 11,000 species of birds and they come in all shapes and sizes. Most fly, many swim, while a few are bound to land. And unlike many other types of animals, they can be found nearly everywhere, from the frigid shores of Antarctica to the deserts of the Arabian Peninsula. This great diversity is partly what fascinates those who observe them. But differences between species can be extremely difficult to detect, even for experts.

Multimodal models hold the potential to help scientists, environmental professionals, and even the casual birder, accurately identify species. But today’s models struggle to do this across the wide variety of birds. And they’re even worse when it comes to rare and regional varieties because they aren’t trained on the subtle physical and acoustic features that determine boundaries between species. Models also tend to make predictions based on how frequently species appear in their training data, biasing them towards more common types.

To address this limitation, researchers at MBZUAI have taken a step towards improving the ability of multimodal models to detect bird species. They developed a new training dataset, a benchmark dataset, and a new multimodal chatbot that can interpret images, audio, and text.

Yevheniia Kryklyvets, a graduate of the master’s program in computer vision at MBZUAI, led the development as part of her master’s thesis under the supervision of Assistant Professor of Computer Vision Hisham Cholakkal. Kryklyvets says that their innovation could be especially valuable for environmental agencies and organizations involved in monitoring avian habitats.

A study about the work was recently shared in an oral presentation at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025) in Suzhou, China. It was selected as a “Senior Area Chair Highlight,” a prestigious recognition awarded to the best papers at the conference.

In addition to Kryklyvets and Cholakkal, the authors of the study are Mohammed Irfan Kurpath, Sahal Shaji Mullappily, Jinxing Zhou, Fahad Shabaz Khan, Rao Muhammad Anwer, and Salman Khan.

Building a new dataset

Identifying the correct bird species from an image or a recording is an example of a fine-grained classification problem in machine learning – a challenge for multimodal models because they are typically trained on broad datasets that lack domain-specific detail.

For advancing biodiversity conservation and supporting ecological monitoring, it’s often necessary to go beyond conventional fine-grained classification and provide species-specific multimodal question–answering that covers diverse recognition and reasoning question types that relate to visual attributes, audio-based emotions, habitat, food habits, and more.

To address this challenge, the researchers built a collection of images, audio, and text called MAviS-Dataset that is the first large-scale multimodal resource dedicated to fine-grained avian species.

It covers more than 1,000 bird species across all major avian families and geographical regions, containing approximately 420,000 images and 115,000 audio clips. On average, each species is associated with roughly 210 images and 115 audio recordings, although the number varies by species. It includes two subsets: a pretraining dataset and a fine-tuning dataset.

To create the MAviS-Dataset, the researchers combined several open-source datasets, for example, BioCLIP and Tree of Life, with other data, including 3,000 audio recordings of rare birds curated by the Cornell Lab of Ornithology. The researchers say that their approach ensures that their dataset is well-rounded, encompassing data from both experts and the public.

The researchers supplemented each species with text that describes how it behaves, what it looks and sounds like, and where it lives.

They built an automatic annotation pipeline using open-source multimodal models to “enrich” the fine-tuning subset. Fine-tuning examples are paired with several question–answer pairs that relate to traits like appearance, sound, and habitat.

Even the most powerful multimodal models don’t perform well on fine-grained understanding problems, which why it was important to design the instruction tuning dataset,” Cholakkal explains.

A new multimodal model for avian species detection

The team developed a multimodal chatbot called MAviS-Chat built on the Mini-CPM-o-2.6 architecture. MAviS-Chat includes a vision encoder, audio encoder, and an open-source language model.

The researchers fine-tuned MAviS-Chat on the MAviS-Dataset using both the pretraining and instruction tuning subsets and used a technique called low-rank adaptation (LoRA) to improve its performance.

They also developed a benchmark dataset called MAviS-Bench specifically designed to evaluate multimodal models in the task of bird species detection. They compared MAviS-Chat to other open- and closed-source models — including GPT-4o, Gemini 1.5, and the base version of MiniCPM-o-2.6 – on MAviS-Bench.

They found that MAviS-Chat outperformed the baseline MiniCPM-o-2.6 model by a significant margin on several performance metrics. It even performed better than GPT-4o in some cases. According to a metric known as ROUGE-1, MAviS-Chat scored 34.17 compared to GPT-4o’s 30.55; according to another metric called MoverScore, MAviS-Chat scored 54.76 while GPT-4o scored 54.03.

Overall, the researchers found that MAviS-Chat achieved state-of-the-art open-source results and demonstrated the effectiveness of their instruction-tuning approach. It’s also much smaller than the proprietary models that were tested, establishing a “promising middle ground” between general-purpose models like GPT-4o that perform well but have high inference costs, they say.

Taking flight

The team’s findings illustrate of the tension between breadth and specificity and show how a model trained on a small but curated dataset can outperform one trained on a much larger but more general dataset.

Kryklyvets, however, isn’t just interested in technological insights, but how AI systems can be applied in the world. “This is a case where AI can make a real impact by helping people working in sustainability and habitat management,” she says.

Cholakkal explains that the researchers will continue this effort by building an app that works on top of MAviS-Chat. This will allow the model to be available to more users and allow them to submit photos and audio recordings of birds they encounter in the wild.

Related

thumbnail
Wednesday, November 26, 2025

How many queries does it take to break an AI? We put a number on it.

MBZUAI researchers propose a universal ruler for jailbreak risk, measuring bits per query to forecast an attacker’s.....

  1. research ,
  2. conference ,
  3. neurips ,
  4. llms ,
  5. paper ,
  6. jailbreaking ,
  7. queries ,
Read More
thumbnail
Friday, November 21, 2025

MBZUAI’s Iryna Gurevych wins 2025 Royal Society Milner Award

The prestigious honor recognizes Gurevych’s advances in language AI and new defenses against misinformation.

  1. Royal Society ,
  2. natural language processing ,
  3. computer science ,
  4. nlp ,
  5. llms ,
  6. award ,
  7. fact checking ,
Read More
thumbnail
Thursday, November 20, 2025

MBZUAI researchers earn high-profile honors at EMNLP

Two MBZUAI papers were named Senior Area Chair Highlights at EMNLP 2025 – one of the conference’s.....

  1. reliable ,
  2. research ,
  3. nlp ,
  4. EMNLP ,
  5. award ,
  6. multimodal ,
  7. environment ,
  8. conservation ,
  9. uncertainty ,
  10. conference ,
  11. biodiversity ,
  12. natural language processing ,
Read More