One of the goals of the MBZUAI Metaverse Center is to provide people with “superhuman” capabilities, allowing them to do things in virtual worlds that they can’t do in real life, said Hao Li, director of the Metaverse Center and associate professor of computer vision at MBZUAI.
But a challenge to making virtual worlds more appealing to users is the development of technologies that allow users to generate realistic avatars of their choosing. Today there are virtual worlds in which users can create graphical avatars that are sufficient for some experiences, but no one would confuse what they see in these environments with reality.
Li and his colleagues recently developed a new technology for generating photorealistic 3D avatars for use in virtual environments, such as 3D video conferencing and augmented or virtual reality (AR/VR). Their approach was presented at the 2024 Conference on Computer Vision and Pattern Recognition, which was held this month in Seattle and is one of the largest conferences in computer vision. The project is a collaboration between ETH Zurich, VinAI Research, Pinscreen and MBZUAI.
Another dimension
Generating photorealistic avatars quickly and efficiently is a difficult task, and it’s one that scientists have been trying to address for years. “It’s extremely hard to create an avatar that actually looks like a human,” Li said.
The challenge essentially relates to how to map a 2D image from what is known as the “source” of the avatar, which is the person or character the user would like to emulate, to the 3D simulation of the user in the virtual world, known as the “driver.”
What’s more, because virtual worlds are often 3D, the combined avatar needs to look good from any angle. “Opposed to video, it must be 3D, so there is the added complexity of poses, expressions and views,” Li said.
The task is perhaps best illustrated by an example. Say that even though in your daily life you spend most of your waking hours sitting at a desk, typing away at a computer. When you have the opportunity to participate in a virtual world, you’d like to take on the likeness of the late globetrotting philosopher of food and culture, Anthony Bourdain. How can Bourdain’s face be melded with your own movements in another reality?
“Typically, the way people have tried to solve this problem is by building complicated models of how human faces appear and how people express themselves,” Li explained. “But it’s a lot of engineering, and usually the outcome still looks a little weird.”
Researchers at Meta, for example, have developed a method that uses more than 150 cameras to capture movements of a person’s face to aid in the creation of a digital avatar. It works rather well, Li said, but this approach is impractical and computationally intensive.
One of the great benefits of Li and his colleagues’ approach is that it only needs one photo of the source to create a realistic reenactment based on the movements of the driver.
“AI is part of the software stack to digitize and render the avatar in real time,” Li said. “The deep neural network is like a super complicated filter in that it takes in 2D information in the form of a photo and sends out 3D information. It’s a form of generative AI and we allow it to create an avatar from one photo by letting it use its imagination.”
They call their approach “VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot Head Reenactment.”
Li explained that the AI used in the new technology was responsible for a few important activities. It turns the 2D source image into a 3D representation. Second, it extracts what is known as the identity of the source image — these are the characteristics of a face that make it unique and identifiable in a variety of different settings and context. It also extracts expressions from the driver, which are the facial positions, poses and movements that a person makes. Together this process is known as disentanglement.
Through this process, both the source and driver images are put into the same, standardized pose, what is known as a canonical pose. “All the avatars in 3D are looking in one specific direction, which makes learning expressions much easier for the system because it reduces the dimensionality,” Li said.
This technique “treats identity and expression disentanglement independently from head pose,” Li explained. “We want to reduce dimensionality so that a model can learn to extract the features and regenerate the content effectively.”
The result is that the source provides the way the avatar looks, while the driver provides the poses, movements and expressions. Several examples can be found on the website for the project.
How and why
The team used an architecture known as vision transformers to process the source and driver information. And while transformers have been shown over the past few years to be an extremely effective architecture for computer vision and natural language processing tasks, Li believes that the quality of the data that is used to train any given system is perhaps even more important than the architecture.
“There are a lot of studies in which people argue about one architecture being better than another. I think that’s not necessarily true. How you train a network is often overlooked but it makes a huge difference,” he said.
Li and his team are already working on the next version of Voodoo which could generate not only the face for an avatar, but a whole body.
As with many powerful AI applications, there are risks. But Li says that making the data open source could help detection frameworks to identify misuse.
Perhaps a deeper concern of Li’s relates to the impact that widespread use of avatars may have on individual psychology and society as a whole.
“One day people will be using these kinds of technologies. In a world in which people can be anyone they want, it’s important to think about what these products should look like and what some of the safeguards should be,” Li said.
The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....
Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.
A team from MBZUAI is improving LLMs' performance across languages by helping them find the nuances of.....