A unified theory of all things visual

Monday, April 03, 2023

Life arose around 3.7 billion years ago. Photosensitive cells appeared sometime later and this kicked off what is believed to be the slow progression from eyespot to eye cup, and the eventual growth of compound eyes. The first documented “eyes” and with them some semblance of what we would call “sight,” came with the discovery of a trilobite, Olenellus fowleri, which lived at some point before the Cambrian explosion.

Sight, as we all know, is not the only sense, but it is the dominant sensory modality in the human world. It is therefore hard, given its legacy and importance, to understate the degree to which sight is important in the living world of things, and consequently, to the integration of machine systems that must function properly in that world.

Take a typical city street for example. Cars move at different speeds, in different directions, and for many different purposes. People and animals filter through in highly asynchronous ways. Obstacles of all sorts dot the landscape, some living, some high value. In parts of the developing world, this frenzy is taken to the extreme as cattle and school children roam amongst a range of small transport vehicles, lorries, and buses.

Once we understand these visual relationships, then systems will start to make sense of the real world, not simply see it.

Professor Fahad Khan
MBZUAI Professor of Computer Vision
Around the world there are traffic systems including signs, signals, lights, and lines, but they are followed to varying degrees depending on local culture, time of year, political environment, and myriad additional variables that are hard to forecast with accuracy. In a word, the visual world is “chaos.”

For the makers of robots and autonomous vehicles, some of the scenes above could mean collisions, system paralysis, or even catastrophe. For researchers attempting to make sense of this chaos and develop algorithms that make sense of the chaos, the situation is even more challenging. For Professor Fahad Khan, it’s been a career.

Khan is Deputy Department Chair of Computer Vision, and has recently been promoted to Full Professor of Computer Vision at MBZUAI. He aims to understand all of this chaos at a level that few have the patience or tenacity to muster. His grand vision is to create a detailed understanding of the entire visual world as a way to enable scientists, engineers, roboticists and the like, to solve all manner of visual perception tasks with numerous potential real-world applications.

Khan wants to enable the kind of AI we’ve been hearing about for a generation — smart cities, systems of personalized healthcare, and even the fully autonomous vehicles we all aspire to travel in, even if we live in dense population centers like Nairobi, Lahore, or Shanghai.

Khan’s pursuit, since his days in graduate school, has been a grand theory of machine visual intelligence that can do everything humans can in the visual spectrum, as well as many of the things human can’t in terms of speed, wavelength, field of vision, and more.

“Machine perception, in particular the ability to understand the visual world based on input from sensors, is one of the central problems of Artificial Intelligence,” Khan said. “Our research is centered on learning visual recognition models with little to no manual supervision, in terms of both scene semantics and 3D geometry. And then moving one step further to understand these high-level detailed visual relationships among different objects within a scene. Once we understand these visual relationships, then systems will start to make sense of the real world, not simply see it.”

According to Khan, the industry has seen tremendous progress in robotics and autonomous driving, for example. His team’s recent results towards developing novel, state-of-the-art, deep learning-based visual recognition models along with understanding them in terms of their robustness and generalizability has really moved the mark towards achieving full machine visual intelligence.

Khan breaks the space down into three overlapping areas. One is understanding sophisticated deep learning-based visual recognition models in terms of their robustness and generalizability, the second is learning visual recognition models with little or no manual human supervision, and third is beyond instance-level recognition to understand a more detailed semantics of the visual content.

Over the years, Khan has worked extensively on different visual understanding problems, particularly on the problem of video understanding. The research work has led to numerous recognitions and awards, such as an IEEE ICPR 2016 best paper award and a more recent CVPR 2022 Best Paper Finalist, as well as top ranks on various international challenges in Visual Object Tracking.

Khan has published over 150 reviewed conference papers, journal articles, and book contributions, with several of his papers receiving more than 1,000 citations. Khan regularly serves as an Area Chair at top AI conferences such as CVPR, NeurIPS, and ICCV, and for journals such as TPAMI, CVIU, and TNNLS. He is serving as a program co-chair for DICTA 2023.

His research work has brought his students recognition as well such as Best Nordic Ph.D. thesis award for one of his doctoral students, and Swedish computer society best M.S. thesis awards. Khan also supports his postdocs and students to publish in top conferences and journals, including first papers for MBZUAI master’s students at NeurIPS 2022, ECCV 2022, and CVPR 2023.

Related

thumbnail
Wednesday, January 15, 2025

Cultural inclusivity in AI: A new benchmark dataset on 100 languages

Developed by MBZUAI scientists, the new dataset will enable greater cultural and linguistic inclusivity in multimodal LLMs.

  1. inclusivity ,
  2. linguistics ,
  3. benchmark ,
  4. languages ,
  5. multimodal ,
  6. inclusion ,
  7. llms ,
  8. dataset ,
  9. large language models ,
  10. computer vision ,
Read More
thumbnail
Monday, January 13, 2025

MBZUAI students win award for study presented at Asian Conference on Computer Vision

The students won the best student paper runners up award at ACCV for their new method called.....

  1. students ,
  2. ACCV ,
  3. award ,
  4. computer vision ,
  5. research ,
  6. student achievements ,
Read More
thumbnail
Wednesday, December 18, 2024

AI and the Arabic language: Preserving cultural heritage and enabling future discovery

The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....

  1. large language models ,
  2. llms ,
  3. jais ,
  4. Arabic language ,
  5. atlas ,
  6. language ,
  7. Arabic LLM ,
  8. United Nations ,
Read More