New multimodal model brings pixel-level precision to satellite imagery

Tuesday, July 15, 2025

People use multimodal models that are designed to interpret both image and text for a wide range of tasks, including image classification, image segmentation, and scene understanding. For the most part, these systems were built to analyze the kinds of images we come across every day, those taken from the perspective of a person. But they have the potential to interpret any kind of image, including ones taken from far above.

Remote sensing images — which are captured by satellites, drones, and other aerial sensors — are used in fields like environmental management, urban planning, and disaster response and systems that can quickly process large amounts of visual data can be extremely helpful to people working in these disciplines.

But today, even the best general multimodal systems struggle with remote sensing images. Radical changes in perspective and the size of objects make it difficult for general multimodal systems to accurately interpret them. And while researchers have built multimodal models that are specifically designed for remote sensing, these systems don’t perform all the tasks that more general models do.

For the first time, a team of researchers from MBZUAI and other institutions has designed a multimodal model that supports the task of pixel grounding with remote sensing images. Pixel grounding is useful because it associates individual pixels in an image with specific object categories, such as buildings and cars, and expressions that refer to them, for example, “the largest baseball field.” This provides users the ability to analyze images in extreme detail.

“With our model GeoPixel, we have achieved a step forward in tying up natural language to objects at the pixel level” in remote sensing images, said Akashah Shabbir, a Ph.D. student in Computer Vision at MBZUAI.

Shabbir is co-author of a study that describes GeoPixel, a dataset used to train the model, and a benchmark dataset designed to evaluate multimodal models on remote sensing tasks. She will present her and her colleagues’ findings at the upcoming International Conference on Machine Learning (ICML) in Vancouver. Mohammed Zurmi, Mohammed Bennamoun, Fahad S. Khan, and Salman Khan are co-authors of the study.

A new architecture and training dataset

GeoPixel combines remote sensor image processing with the capabilities of a large language model says Salman Khan, associate professor of Computer Vision at MBZUAI and co-author of the study. When a user asks GeoPixel about an image, the system places distinct masks on top of relevant objects in the image and ties these masks to its text output, allowing for detailed analysis of images.

“If you ask another geospatial model if there are three ships in an image, it may tell you that there are three ships, but it won’t tell you where they are precisely” because the existing models don’t support pixel grounding, Khan says.

An image captured by satellite might cover several kilometers, with buildings and roads taking up small parts of the overall image, making it difficult for models to identify details. Images captured by satellite are often high resolution as well.

Other multimodal systems designed for analyzing remote sensing images can’t handle large images, but GeoPixel processes images up to 4K resolution. It does so by breaking up images into small patches and generates a low-resolution version of the overall image. The patches and low-resolution global image are passed to a vision encoder. Features of the image are projected onto a large language model and aligned with a technique called partial low-rank adaptation (pLoRA). When the model generates an output, objects mentioned in the output are tied to pixel-level segmentation masks in the image.

“Both the overall context and the details are important in an image and this approach helps the model understand the details at high resolution,” Shabbir says.

Proposed architecture of GeoPixel.

A new remote sensing dataset

GeoPixel’s abilities are made possible by the new dataset that was used for training. Called GeoPixelD, the dataset is a multimodal grounded conversation generation dataset that is composed of nearly 54,000 phrases that are linked to more than 600,000 object masks.

These annotations “provide rich semantic descriptions that integrate both comprehensive, scene-level contextual information and precise, localized object-level details,” the researchers write. Much of the current dataset relates to urban environments that include things like roads, buildings, busses, and football fields.

The team developed a pipeline for filtering and verifying the dataset. “We put a lot of work into generating a dataset that has natural language descriptions tied to segmentation masks,” Shabbir says.

Results on a new benchmark and next steps

The researchers tested the performance of GeoPixel and other multimodal models on understanding tasks on their new benchmark dataset, which includes more than 5,400 pairs of expressions and segmentation masks.

They found that GeoPixel performed better than the other models on tasks known as grounded conversation generation and referring expression segmentation.

Khan says that while GeoPixel’s performance is impressive, there are areas where its reasoning and understanding capabilities can improve. He hopes that other researchers will contribute to the project as the data and code for this study is open-source.

Shabbir says that future iterations of GeoPixel could be developed to integrate other kinds of remote sensing data, such as infrared images. And one day it may be used to help those working in critical roles such as environmental management and disaster response.

Related

thumbnail
Friday, July 11, 2025

Not your typical digital twins: How two Emirati brothers are using AI to make the UAE’s roads safer

Identical twins Abdulla and Abdulrahman Almarzooqi share more than looks. Together, the MBZUAI master’s graduates are working.....

  1. MLLMs ,
  2. commencement ,
  3. commencement 2025 ,
  4. graduates ,
  5. alumni ,
  6. graduation ,
  7. machine learning ,
Read More
thumbnail
Tuesday, July 01, 2025

How MBZUAI’s first NLP Ph.D. graduate is bringing bias to light

Muhammad Arslan Manzoor joined MBZUAI as the University’s first NLP Ph.D. student in 2021. He’s now on.....

  1. Ph.D. ,
  2. commencement ,
  3. commencement 2025 ,
  4. bias ,
  5. media ,
  6. graduates ,
  7. alumni ,
  8. natural language processing ,
  9. nlp ,
Read More
thumbnail
Monday, June 23, 2025

A compact multimodal model for real-time video understanding on edge devices

Mobile-VideoGPT uses efficient token projection to enhance a model’s efficiency while maintaining high performance.

  1. computer vision ,
  2. GPT ,
  3. edge devices ,
  4. multimodal ,
Read More