At a pivotal moment in the film “The Matrix,” the dystopian story of the subjugation of humans by highly intelligent and all-powerful technology, the leader of the human resistance Morpheus, tells his student Neo that “there’s a difference between knowing the path and walking the path.”
Morpheus doesn’t elaborate on the meaning of this statement. But one interpretation is that there is a fundamental difference between thinking and doing, contemplation and action.
In the realm of artificial intelligence, Morpheus’ words could also be used to illustrate the gap that exists between data-driven artificial intelligence systems — like large language models and computer vision programs — and embodied agents, or robots, that inhabit the real world and must interpret and respond to all its messiness and complexity.
Ivan Laptev, professor of computer vision at Mohamed Bin Zayed University of Artificial Intelligence, seeks to bridge this gap.
Challenges of embodiment
Watching “The Matrix” was formative for Laptev, and he was intrigued by the questions the film poses about how we perceive and interpret the world around us. He began his career in computer science training machines to recognize objects in images. Later he focused on action recognition, using supervised learning to teach programs to identify human actions in video. Today, he is working to transfer innovations developed in computer vision to robotics.
There are significant challenges in doing so. One major hurdle relates to data. “In robotics, there is much less data than we are used to in computer vision,” Laptev said. “And data is key to many recent advancements in training machines to interpret images and language.” These include the neural networks that power image recognition applications and large-language models trained on huge data sets.
One reason data is sparse with robots today is that “there are very few robots in the world and they don’t put videos and images on the internet,” Laptev said. “The robots that do exist are in labs where all the scenarios are limited and boring,” compared to the real world.
Another difference between embodied agents and current computer vision systems is that robots generate new data by performing actions and changing the state of the world they are in. This is a scenario that repeats endlessly.
“When a robot is planning, there is no way to explicitly account for every possible sequence of actions that it could take,” Laptev said. “Robotics is a realm in which there is little data and lots of possibilities. We have to figure out a way to do something about this.”
There is also the challenge in robotics of the requirement to operate in real-time. On one hand, robots must respond to changes as they happen in the environment they operate in, so they must constantly calibrate their state and their perception of the world. On the other hand, operating in real-time is a challenge for training because real-world training is dictated by the speed of reality. It would take years to collect enough real data to train a robot.
To advance the field of robotics to a point where it can benefit people more, scientists will need to build systems that accurately simulate reality so that robots can learn and plan in these simulations before striking out into the world.
Thinking ahead
“We as humans have some simulation in our head which tells us in some more or less certain way what will happen in the world around us if we take a certain action,” Laptev said. “These kinds of simulators don’t currently exist for machines.”
Laptev describes this technology as a “world model” or a “world simulator” and it’s one of the central goals of his current work. “The simulator would provide robots with the ability to plan actions on a variety of time scales,” Laptev said. For example, the simulator could be used to help a robot plan to shop at the local supermarket. It could also be used to plan for more complex and conceptual tasks, like moving house to a city on a different continent six months from now.
Over the years, there have been advancements in creating simulations of the world. Many video games today employ what are known as physics simulators that recreate the way objects move in a visually realistic way. These systems can mimic phenomena like friction or how an object falls off a table and onto the floor.
However, there are significant limitations to today’s physics simulators and models trained by them don’t transfer well to the real world. “I see the opportunity to build something much more certain, something that is grounded on lots of information which will allow us to make a simulation that is more accurate to the way things behave in the world,” Laptev says.
Laptev believes that innovations in computer vision can provide the foundation for these simulators. “The big hope is to learn from data like videos on the internet, for example, which show lots of interactions between people and the world,” he said. “We may be able to capture this interaction data and distil it into a model which can predict what will happen if an action is taken in a scene.”
Yet this data-driven approach isn’t quite sufficient today. “People and animals learn intuitive physics at an early age by taking actions in the world and observing how the world reacts,” he said. “This biological learning, however, happens from very few samples and we do not yet know how to do the same with AI.”
A way to supplement the data-driven approach is to leverage classical physics models as guidance to help embodied agents learn more efficiently from data. “How to marry classical physics with machine learning is an open challenge,” Laptev said.
It is a research question that needs to be addressed if robots that can act in the real world are to become more prevalent and useful to people.
In Morpheus’ words, a world simulator like the one Laptev envisions may allow embodied agents to leap from simply knowing the path to walking it.