If you have followed the news recently, you may have read about AI’s ability to generate a wide variety of images of humans who have never actually existed in real life. For example, OpenAI’s Dall-E is a free and widely used deep learning model that generates digital images. Type in a simple prompt — such as “a man in a suit walking to work carrying a coffee” — and the results may be surprisingly realistic at first glance.
Upon a closer look, however, there is often something amiss with the physiological details of these virtual humans.
While human bodies are often rendered accurately by human generation models, these systems have greater difficulty with other features, such as hands, which are often produced with six, seven — or sometimes even more — fingers.
It may seem odd that a basic characteristic of human anatomy is difficult for these powerful machines to recreate, but the results are representative of a “fundamental problem with computer vision and deep learning,” said Xiaodan Liang, visiting associate professor of computer vision at MBZUAI.
Liang studies computer vision at MBZUAI and develops applications to address tasks such as human recognition, human parsing, and digital human generation. She and her colleagues frequently look outside their discipline to improve their models.
“Our research is often inspired by neuroscience because from neuroscience you can get feedback about common sense. A relevant example of human common sense is that a human hand typically has five fingers. This information can be integrated into a deep learning model and used as feedback to supervise the model,” she said.
Deep learning technologies are often described as a black box because humans don’t know how exactly these machines come up with their results. According to Liang, most deep learning technologies are not interpretable by humans.
It’s not known for sure why deep learning human generation models create hands with more fingers than they should have. But adding human common sense to deep learning could improve results.
“Our research is trying to incorporate common sense into deep learning and our theory is that we can empower deep learning by integrating common sense knowledge into the algorithm to make deep learning more reasonable,” Liang said.
When embarking on a project, Liang and her colleagues begin with a real-world problem and then figure out applications to solve those problems.
“Working in the field of computer vision, we are always working to address real-world applications. We begin with a problem and then design an algorithm to that can address it,” she said.
In addition to human generation, Liang has recently focused her efforts on human parsing, an activity by which a machine analyzes an image and is able to determine what areas of the image represent a person and categorize the different parts of a person, such as head, arms, and torso.
An application for this technology is in the garment industry to allow people to virtually try-on clothes before buying them. In the future, you may be able to visit the website of your favorite clothing label, upload an image of yourself, and virtually try on clothes on the company’s website to see how they look on you.
While the concept is somewhat simple, there is a huge amount of complexity that goes into accurately recreating how a piece of clothing may look on a human body.
“Fashion styles are very different,” Liang noted, “and different types of garments, such as loose-fitting and flowing garments, can be difficult to model. There is also a lot of variability when it comes to human body shapes, since people can be tall or short, overweight or thin.”
All these factors and more, according to Liang, must come into play when developing a system to simulate what a piece of clothing might look like on a person.
Liang shared her research on virtual try-on last year at the 36th Conference on Neural Information Processing Systems, or NeurIPS 2022. Her research describes a system that considers 3D aspects of the human body and can accurately model garments across individuals while handling “large pose and viewpoint variations, while preserving garment textures and structures effectively.” The network Liang and her colleagues developed preformed favorably to other existing networks on this task.
Liang anticipates sharing additional research related to virtual try-on at this year’s Computer Vision and Pattern Recognition Conference (CVPR), which is the top event in the field of computer vision. The conference will be held in June in Vancouver, Canada and Liang is serving as one of four ombuds of the conference.
“In the future, we will have developed applications that can generate very realistic humans where you can edit them, replace the body and the face and edit the background,” she said. “If we are provided with images, we will be able to change any aspect and will be able to have the body speak and move in different ways.”
The potential applications are extensive. There are retail applications with virtual try-on, of course, but also bringing more realistic avatars to the metaverse, film, and even to the virtual humans that we can engage with.
The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....
Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.
A team from MBZUAI is improving LLMs' performance across languages by helping them find the nuances of.....