From Abu Dhabi to Silicon Valley: MBZUAI students advance computer vision at Meta

Tuesday, August 26, 2025

The brief was straightforward yet ambitious: build a vision encoder for both images and videos that works out-of-the-box and is the best in the world. That was the project handed to Muhammad Maaz and Hanoona Rasheed, both Ph.D. students in Computer Vision at MBZUAI, on their first day of what turned out to be a transformative internship at Meta’s headquarters in Silicon Valley.

During their time at Meta, Maaz and Hanoona worked side-by-side with employees and other interns, gaining insights into how AI innovations are developed and scaled at one of the world’s largest technology companies.

“I want to build things that are valuable to people and the internship broadened my perspective about how AI systems can address their needs,” Maaz says.

For Hanoona, her experience working on the project was a journey that involved solving challenges that arose along the way. At the beginning, she and her colleagues didn’t know the exact route they would follow to achieve their objective, but “we went where the research took us,” she says.

Inside Meta

It is extremely difficult to land an internship at a top tech company like Meta. And it can be even harder for people based outside of the U.S. But two papers Maaz and Hanoona coauthored at MBZUAI — GLaMM and Video-ChatGPT — got the attention of Christoph Feichtenhofer, Research Scientist on Meta’s FAIR team, who reached out to them about the internship.

“Our advisor, Dr. Salman Khan, always reminded us that if we focused on producing good work, the opportunities would follow,” Hanoona says.

The researchers’ day-to-day activities at Meta included meetings with colleagues to discuss technical challenges and devising creative approaches to solve them. Even though they were part of a large team, “every individual had freedom to explore their own paths,” Hanoona says.

One challenge the research found was that, while their task was to build a vision encoder that worked for both images and videos, there is much less labeled training data for videos than there are for images.

To address this, the team built a multimodal language model called PerceptionLM that was designed to understand spatial and temporal aspects of video. They used this model to generate a synthetic video-caption dataset that was used to train the encoder, called Perception Encoder. The results of their efforts were published by Meta on the company’s blog.

Hanoona focused on building and scaling high-quality datasets for training the two systems. This included curating human-annotated, real-world video datasets with natural-language descriptions that focused on spatiotemporal understanding, spatial and temporal localization, and training PerceptionLM to learn from this data so that it could perform video understanding tasks.

She also created a large-scale, synthetic dataset and analyzed how it affected the scale and quality of data. This involved improving the underlying data engine responsible for generating synthetic video-text data, making it more effective and scalable for training the Perception Encoder.

Maaz was responsible for training and scaling the performance of multimodal language models. He trained PerceptionLM to increase its performance on real-world video tasks such as video question answering, action recognition, and spatiotemporal understanding.

He also trained a multimodal language model that was used for synthetic data generation and optimized it for generating captions for unlabeled videos. This model was used in a loop to support large-scale, synthetic data generation across millions of video samples for training the Perception Encoder.

Where they’ve been and where they’re going

Before joining MBZUAI, Hanoona studied biomedical engineering and worked in the agriculture and healthcare industries. She was a member of the first cohort to receive a master’s degree from MBZUAI before beginning her doctoral program. She was drawn to the field of computer vision because, she says, “it allows you to directly observe the problem, explore the methodology, and visualize the solution.”

Maaz was part of the same master’s degree cohort and worked in industry in his home country of Pakistan before moving to Abu Dhabi. “MBZUAI allows you to grow very quickly and has given me not only a strong understanding of the field of computer vision but also what it takes to produce impactful research,” he says.

During the internship, they forged connections with their colleagues and deepened their presentation and communication skills. And by the end, Meta had extended offers to both of them to return to the company to continue their work in computer vision.

In the meantime, Maaz says that his experience at Meta made him aware of gaps in scientific research that he hopes to fill as he continues his doctoral studies.

Hanoona is grateful to her mentors, Deputy Department Chair and Professor of Computer Vision Fahad Khan and Associate Professor of Computer Vision Salman Khan, for preparing her to make meaningful contributions to advanced research at a major tech company. She says their mentoring and constant support are what drove her and Maaz to take on challenging problems and compete with leading teams from around the world.

After her doctoral program, she is considering pursuing a career in industry. If she decides to pursue that direction, she will do so with a solid foundation in how research is advanced and evolves at one of the world’s leading AI labs.

Related

thumbnail
Monday, August 25, 2025

A new stress test for AI agents that plan, look and click

Researchers from MBZUAI won second place in the AgentX Competition at Berkeley for their new benchmark that.....

  1. research ,
  2. paper ,
  3. agentic ,
  4. benchmark ,
  5. multimodal ,
  6. research assistant ,
  7. computer vision ,
Read More
thumbnail
Friday, August 15, 2025

MBZUAI graduate's journey from Eritrea to empowerment

A Merit Scholarship from the UAE Ministry of Education helped MBZUAI master's graduate Daniel Gebre start creating.....

  1. Class of 2025 ,
  2. commencement 2025 ,
  3. low resource ,
  4. accessibility ,
  5. ML ,
  6. Machine Learning Department ,
  7. alumni ,
  8. graduation ,
Read More
thumbnail
Tuesday, August 12, 2025

Detecting deepfakes in the presence of code-switching

MBZUAI researchers tackle the challenge of detecting deepfakes when people mix Arabic and English in the same.....

  1. Deepfakes ,
  2. dataset ,
  3. nlp ,
  4. Arabic language ,
  5. prompting ,
  6. Arabic ,
  7. deep fake ,
  8. computer vision ,
  9. natural language processing ,
  10. code switching ,
Read More