Making sense of space and time in video

Thursday, October 12, 2023

In new and emerging fields of innovation, scientists tend to work on seemingly straightforward problems first before moving on to those that are more complex. The field of computer vision is no exception. When computer vision was born decades ago, scientists began by writing programs that make it possible for computers to categorize images based on what is represented in them. Image classification programs were able to determine that one image was a human face, another a car, another a tree.

In recent years, scientists who work on computer vision have tackled more complex tasks, such as image segmentation, which is a way for a machine to group image regions into meaningful categories, and object segmentation, which is the process of delineating individual objects from their background. And today, instead of only being applied to still images, computer vision principles are now applied to video, too.

But even the newest innovations draw on previous insights and a team of researchers at MBZUAI is currently developing a new approach to analyzing action in videos that builds on earlier work in still image processing. The study, led by Syed Talal Wasim, research assistant at MBZUAI, was recently shared at the International Conference on Computer Vision, held earlier this month in Paris.

From object to action

It will come as no surprise that it’s a more complex task for a computer to make sense of a video than it is for it to make sense of an image. Videos contain more data than images, with the related demands on processing power. There is also the added dimension of time in moving pictures that complicates analysis.

Yet basic tasks like image classification work well for videos. “A more interesting question,” Wasim said, “is to not only classify a video, but to understand what is happening over time in a video and make sense of this information across frames.” Doing so requires a capability known as temporal aggregation.

There are established approaches for facilitating temporal aggregation in video, Wasim explained. These include convolutional neural networks and transformers. Both approaches work but have limitations.

Convolutional neural networks struggle to process the temporal aspect of video. Transformers, which were first applied in the field of natural language processing, are good at dealing with how time is manifested in video but have big demands in terms of computing power due to their high complexity.

“While people have been doing temporal modeling for years,” Wasim said, “we wanted to know if we could do it effectively while avoiding the issue of quadratic complexity.”

Not straight to video

To begin, they take stills from videos at standard intervals, building a series of images. They then use a technique called focal modulation that was designed for images and apply it to stills from video. They use focal modulation to analyze spatial and temporal information separately and at different scales. “We do spatial focal modulation on the frames and temporal focal modulation on each pixel,” Wasim said.

Their approach, which they named Video-FocalNets, yielded interesting results. Because space and time characteristics were modeled separately, Video-FocalNets indicated which aspect was the more complex issue for a machine to analyze. “There were some videos that were examples of difficult spatial problems, but an easy temporal problem,” Wasim said. “Video-FocalNets was able to figure it out based on the temporal aspect and it provided us insight into the problem itself.”

Wasim said that he is interested to think through how techniques can be used for new applications, but it’s always important to consider how these techniques are being applied. “If you have something that works on images, and you want to make it work on videos, it’s important to consider what is the best way to make it work on videos, because you have another dimension on top,” Wasim said.

Co-authors on the paper are Muhammad Uzair Khattak of MBZUAI, Muzammal Naseer of MBZUAI, Salman Khan of MBZUAI and Australian National University, Mubarak Shah of University of Central Florida, and Fahad Khan of MBZUAI and Linköping University.

Wasim also noted that “the model is inherently visualizable. It shows us what it is focusing on. It doesn’t need and external visualization technique to extract visualizations out of it.”

Related

thumbnail
Tuesday, July 29, 2025

When medical AI meets messy reality

Raza Imam won best paper at the Medical Image Understanding and Analysis conference for his research on.....

  1. medical imaging ,
  2. best paper ,
  3. Vision language model ,
  4. medical ,
  5. award ,
  6. conference ,
Read More
thumbnail
Wednesday, July 23, 2025

The AI model improving air pollution prediction

MBZUAI's Salman Khan and Vishal Nedungadi won the best paper award at ICML's TerraBytes workshop for AirCast –.....

  1. computer vision ,
  2. vision transformer ,
  3. weather ,
  4. icml ,
  5. healthcare ,
Read More
thumbnail
Monday, July 21, 2025

Create and edit images like a smart artist

Shaoan Xie and Lingjing Kong presented research at ICML that could make it possible to create and edit.....

  1. image generator ,
  2. prompting ,
  3. machine learning ,
  4. text to image ,
  5. computer vision ,
  6. icml ,
  7. nlp ,
Read More