Text to image (T2I) generators caused a buzz when they rose to prominence with the launch of tools such as Dall-E, Midjourney and Adobe Firefly. However, users quickly found that anything more than a simple prompt of a few words would confuse the system and lead to images that either looked strange or failed to fulfil the request.
This problem has continued to elude the industry, and was the challenge that Mohammad Hanan Ghani, who recently completed a master’s degree in computer vision at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), took on as the focus of his research.
Ghani studied electronics and communications engineering with computer science for his bachelor’s degree in his native India, and worked at Samsung as a machine learning engineer before deciding to return to education in 2022. “It was a great experience working for such a large company, but I was mainly working on existing projects, and I was keen to work on something completely new,” he said. “I decided to take the plunge and pursue my dream of researching and publishing papers at top conferences.”
Ghani wasted no time tapping into MBZUAI’s wide reserves of expertise and immersed himself in AI challenges, including enabling algorithms to learn from limited data and getting multimodal models to achieve better results by learning from images and text.
“This area of research is important because machines and many types of portable devices should be able to understand images and language,” Ghani said. “In addition to improving the capabilities of robots and autonomous devices to perform useful tasks, it also improves the efficiency with which they can learn.”
Ghani wanted to explore this field because there remains so much work to be done, and it aligned well with his research interests. “I looked for gaps in the research and areas where the models are not performing especially well. I observed that models don’t work well on long text prompts, so I combined the two methods – large language models and diffusion models. I felt inspired to work on some of the challenges that affect the services tech giants provide.”
Ghani, who was advised by Dr. Salman Khan, associate professor of computer vision at MBZUAI, has published three papers at conferences including the International Conference on Learning Representations (ICLR), The British Machine Vision Conference (BMVC), and the Conference on Neural Information Processing Systems (NeurIPS).
“My ICLR paper was on text to image creation from longer paragraphs of text,” he explained. “My machine learning system aims to produce images that accurately fit the text. We improved upon the existing main technique and were able to generate images that exactly follow the details of the text. To the best of our knowledge, we are the first in the machine learning community to have done so.”
The paper has already been cited more than 20 times, underscoring the potential impact of the research.
Ghani is optimistic that his research on efficient learning has the potential to bring the benefits of machine learning to regions that lack robust energy and data networks. “My methods are mostly label efficient and work in low resource environments where there is a lack of data. It could have multiple uses such as bringing AI to scanning devices in hospitals and helping autonomous vehicles to recognize objects and signs.”
Following his graduation, Ghani would like to stay at MBZUAI and continue his research for a Ph.D. He attributes his positive experience at the university to the quality of teaching and mentoring. “My supervisor, Dr. Salman Khan, gave me a lot of freedom to research the areas I found most compelling and was always very supportive and ready to share insights and advice,” he said. “My mentor, Dr. Muzammal Naseer, a research scientist at MBZUAI, was also generous with his knowledge and experience, and helped guide my research.”
To unwind from the rigors of study and research, Ghani participates in sports and likes going to the gym. “Team sports like volleyball are great to unwind and create opportunities to make new friends,” he said. “We’re fortunate to have great sports and recreation facilities here at MBZUAI. It all helps in creating a well-rounded experience.”
The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....
Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.
A team from MBZUAI is improving LLMs' performance across languages by helping them find the nuances of.....