Home / News / Making sense of space and time in video

Making sense of space and time in video

Thursday, October 12, 2023

In new and emerging fields of innovation, scientists tend to work on seemingly straightforward problems first before moving on to those that are more complex. The field of computer vision is no exception. When computer vision was born decades ago, scientists began by writing programs that make it possible for computers to categorize images based on what is represented in them. Image classification programs were able to determine that one image was a human face, another a car, another a tree.

In recent years, scientists who work on computer vision have tackled more complex tasks, such as image segmentation, which is a way for a machine to group image regions into meaningful categories, and object segmentation, which is the process of delineating individual objects from their background. And today, instead of only being applied to still images, computer vision principles are now applied to video, too.

But even the newest innovations draw on previous insights and a team of researchers at MBZUAI is currently developing a new approach to analyzing action in videos that builds on earlier work in still image processing. The study, led by Syed Talal Wasim, research assistant at MBZUAI, was recently shared at the International Conference on Computer Vision, held earlier this month in Paris.

From object to action

It will come as no surprise that it’s a more complex task for a computer to make sense of a video than it is for it to make sense of an image. Videos contain more data than images, with the related demands on processing power. There is also the added dimension of time in moving pictures that complicates analysis.

Yet basic tasks like image classification work well for videos. “A more interesting question,” Wasim said, “is to not only classify a video, but to understand what is happening over time in a video and make sense of this information across frames.” Doing so requires a capability known as temporal aggregation.

There are established approaches for facilitating temporal aggregation in video, Wasim explained. These include convolutional neural networks and transformers. Both approaches work but have limitations.

Convolutional neural networks struggle to process the temporal aspect of video. Transformers, which were first applied in the field of natural language processing, are good at dealing with how time is manifested in video but have big demands in terms of computing power due to their high complexity.

“While people have been doing temporal modeling for years,” Wasim said, “we wanted to know if we could do it effectively while avoiding the issue of quadratic complexity.”

Not straight to video

To begin, they take stills from videos at standard intervals, building a series of images. They then use a technique called focal modulation that was designed for images and apply it to stills from video. They use focal modulation to analyze spatial and temporal information separately and at different scales. “We do spatial focal modulation on the frames and temporal focal modulation on each pixel,” Wasim said.

Their approach, which they named Video-FocalNets, yielded interesting results. Because space and time characteristics were modeled separately, Video-FocalNets indicated which aspect was the more complex issue for a machine to analyze. “There were some videos that were examples of difficult spatial problems, but an easy temporal problem,” Wasim said. “Video-FocalNets was able to figure it out based on the temporal aspect and it provided us insight into the problem itself.”

Wasim said that he is interested to think through how techniques can be used for new applications, but it’s always important to consider how these techniques are being applied. “If you have something that works on images, and you want to make it work on videos, it’s important to consider what is the best way to make it work on videos, because you have another dimension on top,” Wasim said.

Co-authors on the paper are Muhammad Uzair Khattak of MBZUAI, Muzammal Naseer of MBZUAI, Salman Khan of MBZUAI and Australian National University, Mubarak Shah of University of Central Florida, and Fahad Khan of MBZUAI and Linköping University.

Wasim also noted that “the model is inherently visualizable. It shows us what it is focusing on. It doesn’t need and external visualization technique to extract visualizations out of it.”

Tuesday, July 29, 2025

Making sense of space and time in video

From object to action

Not straight to video

Related

When medical AI meets messy reality

The AI model improving air pollution prediction

Create and edit images like a smart artist

Making sense of space and time in video

From object to action

Not straight to video

Related

When medical AI meets messy reality

The AI model improving air pollution prediction

Create and edit images like a smart artist

Subscribe to The Node

When medical AI meets messy reality