Self-confessed research addicts – Hanoona Bangalath (India) and Muhammad Maaz (Pakistan) – have both realized their dream of publishing at top conferences during their master’s degrees in computer vision at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). Among the most promising students in the Class of 2022 and with perfect 4.0/4.0 cumulative grade point averages (CGPA), the pair will remain at the university to complete their Ph.D.During the past two years, the pair have spent most waking hours researching, testing, and experimenting, however, they say “quality over quantity” remains their focus as well as remaining on trend with the most pressing computer vision problems and topics currently being explored. Both credit MBZUAI’s high computing power, living support, and the high profile of faculty as the main reasons for wanting to stay in Abu Dhabi at the world’s first AI graduate institution.
The duo come from technical industrial backgrounds yet still managed to have their first co-authored paper titled “Class-agnostic Object Detection with Multi-modal Transformer” accepted at the European Conference on Computer Vision (ECCV) 2022; the first ever accepted publication from MBZUAI students at a top-tier computer vision conference. Also in 2022, their second paper titled “Bridging the Gap between Object and Image-level Representation for Open-Vocabulary Detection” was accepted at Conference on Neural Information Processing Systems (NeurIPS) 2022; the first ever student paper from MBZUAI to this renowned conference.
Hanoona’s supervisor and MBZUAI Associate Professor of Computer Vision, Salman Khan, and a co-author on the two above-mentioned papers, said: “Maaz and Hanoona are among the very best students we could attract in the first admitted batch at MBZUAI. I admire their curiosity and passion which have led to the hard work they have demonstrated in addressing the challenging research questions. I am happy about the fine successes they had in the past two years and hope that these rising stars will be forebearers of the name and reputation of this young institution.”
With no background in research, they thank their respective supervisors for not only believing in them but teaching them how to conduct research and write reports at a practical level. “Both of us come from a technical industrial background, and their knowledge about the research world is huge,” Hanoona explains. “Within a short span of time, they’ve been able to teach us and build us in such a way that we understand the research culture. They taught us not to be too dependent on them, and to make decisions for ourselves. I fell in love with research and want to be part of a community that will change the face of the future.”
It is this independence that research brings which Maaz loves the most. “The reason I fell in love with research is because it is exciting to be part of a community which constantly creates new ideas to improve the quality of life,” Maaz said. “In research, unlike a nine to five job, we mostly work at our own pace. It provides opportunity for independence in thoughts and action, as researchers can explore new ideas and methods, without many constraints. This can lead to the development of new technologies that can address some of the most pressing issues we face now.”
Neither ever dreamed of publishing at top conferences, and thank MBZUAI for making it happen. “I never imagined I’d do it so quickly and so efficiently,” Maaz said. “It makes you hungry for more. Staying with the same supervisor – after my Ph.D. – I will have a very good research profile.”
Maaz’s supervisor and MBZUAI Deputy Department Chair and Associate Professor of Computer Vision, Fahad Khan, has only praise for the students’ early successes. “They belong to the big leagues of research now,” he said. “They are part of the research community.”
Hanoona added, “I remember Dr. Fahad telling me, “You wouldn’t really understand the importance of publishing in the top-tier conferences, but when you get it and when you’re inside the research citizen circle, you will then understand how it feels”. Maaz adds, “The support and engagement from the research community can be incredibly inspiring and drives you to produce high-quality work. It also brings together the small research groups and the giants like Google and Microsoft.”
Maaz is passionate about being part of the open-source community and researchers working collaboratively to solve real-world problems. “I want to be an industrial researcher,” he said. “I’m trying to familiarize myself with the new technologies, learn to solve problems with new ideas and understand the process of research. I believe research that does not benefit society is not sustainable in the long term.”
Both are working on similar topics towards their Ph.D. research and helping to solve one of the fundamental problems in computer vision – generic object detection. The main theme of the research is generalization and using multimodal understanding from vision and text to improve common-sense reasoning of machines and their deployment on mobile devices. When they undertook their master’s research on this language image pre-training or text image preview it was quite new with less than 10 papers on the subject worldwide.
“Combining language and text or multimodal data in training models allows for greater generalization capabilities,” Hanoona explains. “For instance, a model trained on real-world images would also have the ability to understand satellite images, cartoons and other types of images. Furthermore, models trained with a multimodal approach tend to have a broader vocabulary, being able to recognize an object even if it hasn’t been encountered before. For example, by using its understanding of language, a model that has not encountered a zebra during training could still recognize it. However, applying research to real-world scenarios remains a challenging task, and my focus is on addressing this challenge and developing models that generalize effectively for use in various applications such as self-driving cars, security systems, traffic management, and professional sports.”
“Combining language and text or multimodal data in training models allows for greater generalization capabilities,” Hanoona explains. “For instance, a model trained on real-world images would also have the ability to understand satellite images, cartoons and other types of images. Furthermore, models trained with a multimodal approach tend to have a broader vocabulary, being able to recognize an object even if it hasn’t been encountered before. For example, by using its understanding of language, a model that has not encountered a zebra during training could still recognize it. However, applying research to real-world scenarios remains a challenging task, and my focus is on addressing this challenge and developing models that generalize effectively for use in various applications such as self-driving cars, security systems, traffic management, and professional sports.”
There is not really an application where object detection might not be helpful or might not be a part of. It is used widely and can power many applications in traffic monitoring, safety in self driving cars, security, sports, face recognition, etc. However, today’s state-of-the-art industry specific models which are trained with a separate object detector and end task in mind need improving. They can only detect what they know, and not what they don’t know. The crux of the problem is giving it more ways to generalize, and Hanoona and Maaz believe the utilization of language supervision along with images will make models more accurate and commercially cheaper.
“The performance of traditional models in object detection is limited by their exposure to a specific set of data or objects,' Hanoona explains. 'Traditionally, models have been trained solely on image data. In our research, we augment this by incorporating text during the pre-training stage, using a large corpus of text and images. The rationale behind this approach is that it mimics the natural way humans learn, by absorbing information through reading, listening, and observing, and then connecting the dots. We may not be aware of every object, but by cross-referencing text, images, and prior knowledge, we can identify them.”
Maaz summarizes, “Our message is twofold. Firstly, don’t limit open object detection to just images, incorporate text as well. Secondly, using a large amount of text and visual data to pre-train the model will result in better class-agnostic object detection, which is vital for various downstream tasks and applications. While classification is a relatively easy task, a multimodal model that excels at class-agnostic detection and is generalized to all categories and domains will produce superior outcomes.”
Maaz and Hanoona are among 12 students returning to complete their Ph.D. from the 52 master’s graduates from the inaugural Class of 2022, who will receive their conferral of degrees on January 30, 2023 at the ADNOC Business Center, Abu Dhabi.
Having developed AI tools to fight misinformation and disinformation, MBZUAI alumnus Zain Mujahid has now turned his.....
Read MoreAhmed AlShamsi won first place at the Gulf Hackathon for Cybersecurity in Oman for his project Secure+,.....
The students won the best student paper runners up award at ACCV for their new method called.....