A camera in a backyard catches a parent playing with their children. The microphone on the same device picks up the voices of the parent and the children. Two streams of evidence, two ways of recognizing an action, and most of the time they tell the same story. The video says people are talking, and the audio says the same thing. The multimodal model trained on both says there are people talking, with a high degree of confidence.
But what happens when that same system is asked to apply its understanding to a very different scenario where the lighting may be different or the audio unclear?
This is the question sitting at the center of a new work from MBZUAI, under the supervision of Muhammad Haris Khan, Assistant Professor Computer Vision, which was presented at CVPR 2026 – a question that the AI community has, until now, treated as three separate problems, rather than trying to solve together.
The first problem is domain shift. A model trained on backyard videos from one set of homes tends to do badly on backyard videos from another set. This could be because the lighting is different, the cameras are mounted at different angles, or the backyards themselves are organized differently.
The second problem is data scarcity. Labeling multimodal data, where every example needs to be tagged across both video and audio, is expensive enough that most real datasets are mostly unlabeled.
The third problem is multimodal learning: real systems often lose access to one of their input streams at runtime.
Each of these problems has been studied in isolation, but the authors of the paper point out that nobody has studied them together. And when they take the leading methods from each subfield and run them on the combined problem, the results are uniformly bad. Methods built for domain generalization break when they cannot use unlabeled data. Methods built for semi-supervised learning break when they encounter unseen domains. And methods built for semi-supervised domain generalization, in particular, break because they were designed for a single modality and have no idea what to do when video and audio start contradicting each other.
The authors name the combined problem Semi-Supervised Multimodal Domain Generalization (SSMDG). Their underlying observation is that, most of the time, in a backyard or anywhere else, your modalities mostly agree, and you can use that agreement to bootstrap your way out of having very few labels. The trick is figuring out what to do with the cases where they do not.
The team’s proposal is a graduated trust system, which works in four parts. Firstly, when the fused multimodal prediction is confident and at least one of the individual streams is also confident and agrees with it, the system treats this as a reliable pseudo-label. It uses the label to train itself, in the way that semi-supervised systems have done for years.
When this consensus breaks down, the second part comes into play. This is where the new approach happens. Rather than throw away the disagreeing examples, the authors keep them but treat them with a different loss function, one that is more forgiving of noise. The reasoning is that even an uncertain signal contains some information, and a robust enough learning procedure can extract it without being misled by the inevitable errors.
The third piece of the system addresses the question of representation. The authors maintain running averages of what each class looks like in each modality and each source domain. These class prototypes serve as anchors. New examples are pulled toward the prototype that matches their predicted class, both within their own domain and across other domains, and the system learns lightweight translators that can convert features from one modality into the other.
The translators do double duty. During training, they enforce that the representations of video and audio for the same class end up close to each other. At test time, if one modality goes missing, the translator can synthesize the absent features from whatever is present.
Lastly, the system can be trained on full multimodal data and then deployed in a setting where, say, the audio channel is unavailable. Most existing approaches either fail completely in this scenario or fall back to processing only the available modality, which discards everything they learned about how the modalities relate to each other. The translation-based approach instead reconstructs an estimate of what the missing modality would have looked like and uses it. On the standard benchmark with five labels per class and a missing video stream, the translation approach beats simple substitution by roughly six percentage points, which is the difference between a usable system and a frustrating one.
On a kitchen action recognition benchmark with five labels per class, the system reaches just under 40% accuracy, compared to roughly 37% for the strongest baseline. On a stylistic generalization benchmark called HAC, which asks whether actions recognized in human videos can be recognized in animal videos and cartoon videos of the same actions, the numbers are higher and the improvements more pronounced.
The paper also explores how modalities should be treated. The dominant approach in multimodal learning has been to fuse everything early, train one large model on the fused representation, and let the network figure out which modality matters when.
The authors argue that modalities are useful precisely because they sometimes disagree, and that a system which knows how to handle disagreement – by trusting consensus more, by tolerating discord rather than ignoring it, and by maintaining a sense of what each modality looks like on its own – is doing something more sophisticated than fusion alone can capture.
For now, the benchmarks involve only video and audio, with a third experiment adding optical flow. Whether the same approach scales to settings with text, depth, thermal imaging, or any of the other modalities that real systems increasingly want to combine is an open question. The improvements over baselines, while consistent, are not large enough to settle the matter on their own.
By navigating the balance between cross-modal consensus and informative disagreement, SSMDG provides a practical pathway for deploying multimodal models that remain resilient even when data is scarce and environments are unpredictable. The authors hope that their problem formulation, benchmarks, and framework will stimulate further research in this area.
MBZUAI researchers have developed a new approach to video understanding that reduces computational costs while maintaining accuracy.
A new system developed by MBZUAI embeds unique, key-driven watermarks into AI-generated videos, making synthetic content more.....
MBZUAI researchers have developed a new approach that helps AI interpret complex medical scans without sacrificing the.....