Images captured by satellite, aircraft, and drone provide valuable information about the Earth’s environment and are used by researchers in many fields, ranging from agriculture to disaster response to climate science. Vision language models (VLMs) hold the potential to help researchers make sense of huge volumes of visual information, but today’s VLMs struggle to process the wide variety of data that’s collected by remote sensing technologies, which include infrared, radar, and optical images in different resolutions.
Researchers at MBZUAI, IBM Research, and other institutions have developed a new VLM called EarthDial that has been specifically designed to process geospatial data. It’s the first model of its kind that can handle data in a range of modalities and resolutions, while also processing images captured at different points in time to observe environmental changes.
The developers of EarthDial tested the system on more than 40 tasks that included image classification, object detection, change detection, question answering, and image and region captioning. They found that their model performed better than other models on many of these tasks.
The team will present their findings at the Computer Vision and Pattern Recognition Conference (CVPR) currently being held in Nashville, Tennessee.
Researchers have built general VLMs that can complete many tasks, such as image classification, object detection, and question answering. These models, however, typically aren’t trained on geospatial data.
Developers have also built VLMs specifically for geospatial data, but today’s models don’t work well with high-resolution images of varying sizes, and they don’t support multi-spectral or multi-temporal analysis.
“Our goal was to build a unified model that could handle complex geospatial data, bridging the gap between generic VLMs and domain-specific models,” explains Akhtar Munir, a postdoctoral associate at MBZUAI and one of the developers of EarthDial.
Tasks that EarthDial can be used for include visual question answering, scene classification, disaster assessment, tree species classification, methane plume detection, and urban heat island detection.
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D. Watson, Levente J. Klein, Fahad Shahbaz Khan, and Salman Khan are coauthors of the study.
EarthDial is made up of three components: a visual encoder, a multilayer perceptron projector, and a large language model (LLM). The visual encoder is built on a model called InternVL that was modified for multi-spectral and multi-temporal processing.
Remote sensing images are produced in a variety of sizes and resolutions, which makes it difficult for one system to interpret features across images of different sizes and resolutions. To address this, the researchers used what is called a “dynamic resolution input strategy,” which enhances the system’s ability to analyze fine-grained details in images. This approach automatically selects an optimal aspect ratio for images from a pre-defined set of ratios, divides images into patches, and creates lower-resolution thumbnails to help the model understand the overall scenes that are depicted.
The multilayer perceptron projector translates information about the images into a format that can be interpreted by the LLM. The researchers write that “this fusion strategy allows EarthDial to integrate visual data from various modalities together with textual descriptions, improving its performance on complex” remote sensing tasks. For the LLM, the researchers adapted a pre-trained system called Phi-3-mini.
“There was no unified architecture previously,” Munir says. “Other works have separate encoders for separate modalities and that is computationally inefficient.”
To train the system, the researchers developed a large dataset of question-answer instruction pairs drawn from several existing remote sensing datasets, including SkyScript and SatlasPretrain. They call their new dataset EarthDial-Instruct, and it’s the largest dataset of its kind with more than 11 million samples.
Munir said that constructing EarthDial-Instruct required significant manual effort to verify a selection of samples that were curated from the pre-existing datasets. This manual effort, however, ensured that the data that was used to train EarthDial was high quality.
The researchers trained EarthDial in three stages. In the first stage, they trained the system to make associations between remote-sensing images and text descriptions of those images. In the second stage, they improved the performance of the LLM on tasks that it hadn’t previously been exposed to, what’s known as zero-shot performance. In the third stage, they trained it on multispectral images, high-resolution optical images, and on data produced by a remote sensing technology known as synthetic aperture radar.
“We took a multi-stage approach to training the model and tried different learning schedules and progressive training approaches because we had to make it compute efficient and generalizable across many tasks,” Munir said. “If you have really high accuracy on one task, you will probably lose some accuracy on another task and these are the kinds of things we had to carefully figure out.”
The researchers compared EarthDial to two general VLMs (GPT-4o and InternVL2-8B) and a specialized model (GeoChat) on a classification task. EarthDial was more accurate than the other models across six datasets, with EarthDial outperforming the next-best model (GPT-4o) by nearly 20 percentage points on the BigEarthNet dataset.
EarthDial was better than GeoChat, InternVL2-4B, and InternVL2-8B detecting objects across four different datasets. And overall, for the multispectral modality, EarthDial achieved an average improvement of 32.5% on classification accuracy compared to GPT-4o.
In the future, the researchers plan to improve the EarthDial’s performance and expand its capabilities to include image segmentation. The team has released the code for EarthDial, and Munir encourages people to use it, saying that the system will become more efficient with input from the community.
MBZUAI researchers shared FarSight at CVPR, showcasing its ability to improve the performance of multimodal large language.....
Mobile-VideoGPT uses efficient token projection to enhance a model’s efficiency while maintaining high performance.
Researchers from MBZUAI have developed a new set of SSMs that they will present at the Computer.....