Vision and insight: Charting the course of embodied AI with Ian Reid

Thursday, May 02, 2024

Ian Reid, professor of computer vision and chair of the computer vision department at Mohamed Bin Zayed University of Artificial Intelligence, has had a career in artificial intelligence that has spanned decades and continents. He has held appointments at the University of Oxford, the University of Adelaide and now at MBZUAI. He has witnessed, and indeed driven, some of the most important developments in the field of computer vision over the past 30 or so years.

Throughout his career, Reid has been interested in developing machines that can deliver on the promise of what is known today as embodied AI, which refers to systems that interact with the physical world in the form of robots or other machines and can learn and adapt to the environment around them. “I see embodied AI as distinct from the kind of intelligence that we see emerging from things like ChatGPT,” Reid said. “Embodied AI is the next level of intelligence that requires an understanding of the visual world and the ability to interact with the physical world in an intelligent way.”

Past is prologue

Though embodied AI has a long history as an area of study, it has evolved significantly over the years.

Early in Reid’s career, embodied AI was known as active vision. In his group at Oxford, he developed ways for robots to track objects in the environment. “At the time, this kind of work was sort of an end in itself,” he said. “We wanted to be able to control a robot to track an object, or to move through the environment and understand the geometry of the scene.”

Since that time, the evolution of embodied AI has been shaped by three significant developments in the broader computer vision community, which have transformed how scientists build machines to perceive and respond to physical world, Reid explained.

The first breakthrough took place in the late 1990s and early 2000s, with advancements in understanding that made it possible to use cameras as geometric sensors that could provide information about the spatial relationship between objects in a scene. This is a significant challenge, as inferring spatial information from 2D images requires intricate calculations.

Reid and his colleagues at Oxford led much of this research. During his doctoral studies, Reid worked with Sir Michael Brady, currently an emeritus professor at Oxford and adjunct distinguished professor at MBZUAI. After receiving his doctorate, Reid stayed on at Oxford to teach there for more than two decades.

Reid and his colleagues’ work at Oxford laid the foundation for an innovation known as visual SLAM (simultaneous localization and mapping), which enables cameras mounted on robots to act as geometric sensors, aiding navigation of their surroundings and movement in three-dimensional space. Robot navigation is a complex problem, as the robot must sense the physical characteristics of the environment and its own changing position within the environment as it moves through it. The idea of visual SLAM is that “in real time, as you move a camera through an environment, you’re getting enough information to be able to build a map of the environment and localize the camera with respect to that map,” Reid said.

Reid’s studies on visual SLAM have been hugely influential.

Enter the machines

As the field of computer vision matured further, researchers accomplished another significant advancement by integrating insights from machine learning into computer vision, Reid explained. This period saw the realization that machine learning algorithms — such as random forests, AdaBoost, and support vector machines — could outperform traditional, rule-based approaches in recognizing patterns and objects within images. “People at that time came to realize that it was much easier, more effective and faster to train a generic machine learning algorithm to recognize patterns in images than it was to code a bunch of rules to recognize patterns in images,” Reid said.

These innovations were not limited to computer vision, but paralleled similar advancements that were seen with machine learning’s influence on natural language processing as well, highlighting the versatility and transformative power of machine learning across different domains of AI.

The third, and perhaps most revolutionary advancement came in 2012 with the reemergence of deep learning, particularly through the application of convolutional neural networks (CNNs), an architecture that was initially pioneered by Yann LeCun, of New York University and Meta.

Though the concept of CNNs had been around for decades, the first real demonstration of their power was spearheaded by Geoffrey Hinton and his students Alex Krizhevsky and Ilya Sutskever at the University of Toronto. They used the approach in 2012 in a popular competition known as ImageNet, which used a large data set compiled by Fei-Fei Li of Stanford University and others.

The team from University of Toronto called their approach AlexNet and it showed a dramatic improvement in accuracy over the competition’s previous winner, which employed an architecture known as support vector machines. (The impact of the 2012 ImageNet competition has been recounted in depth in several publications, including the MIT Technology Review and Melanie Mitchell’s “Artificial Intelligence: A Guide for Thinking Humans,” among others.)

The use of CNNs was complemented by other developments taking place at the time, including the creation of huge, labeled datasets — like ImageNet — and the ever-increasing computational power of graphics processing units, or GPUs, which made the use of CNNs more practical.

“The ImageNet competition was the point where computer vision and machine learning really got together and we realized that the field was being revolutionized and would never be the same again,” Reid said. “Some of us working in robotics realized machine learning could have a huge impact in robotic vision as well. Robotic vision involves putting camera sensors on robots and is characterized by requirements like real-time operation and the need to deal with an open world rather than a closed dataset.”

Today and tomorrow

While machine learning has had an undeniable influence on computer vision, Reid is skeptical of machine learning’s ability to solve for every problem. He also is concerned about how knowledge from a variety of domains has often been eschewed for machine-learning based solutions. “I’m interested to take things that we already know about the world and to use learning to make machines better at solving problems,” instead of simply using a brute force approach that leverages huge amounts of data, Reid said.

For example, scientists have made significant advancements in domains such as physics and robotic control theory over the years, Reid explained. Those insights can be applied to applications in artificial intelligence and be augmented by the power of machine learning. “Let’s use machine learning for what it’s really good at, which is as an adjunct to that stuff that we already know, rather than as a replacement for what we already know,” Reid said.

It is an exciting time to be working in artificial intelligence, and Reid believes that unlike in the fields of nuclear fusion or quantum computing, where promise always seems to be 15 years into the future, AI is having a significant influence on many of our lives today. That said, the desire for innovation must be balanced with the impact that these new technologies will have on people.

“I think as professors working in universities, we have the duty to ensure that the work that we’re doing in the development of artificial intelligence is sound and ethically sensible,” he said. “We have the responsibility to educate the public about the good and legitimate uses of artificial intelligence and the ways it is going to make people’s lives better.”

Related

thumbnail
Wednesday, January 15, 2025

Cultural inclusivity in AI: A new benchmark dataset on 100 languages

Developed by MBZUAI scientists, the new dataset will enable greater cultural and linguistic inclusivity in multimodal LLMs.

  1. inclusivity ,
  2. linguistics ,
  3. benchmark ,
  4. languages ,
  5. multimodal ,
  6. inclusion ,
  7. llms ,
  8. dataset ,
  9. large language models ,
  10. computer vision ,
Read More
thumbnail
Monday, January 13, 2025

MBZUAI students win award for study presented at Asian Conference on Computer Vision

The students won the best student paper runners up award at ACCV for their new method called.....

  1. students ,
  2. ACCV ,
  3. award ,
  4. computer vision ,
  5. research ,
  6. student achievements ,
Read More
thumbnail
Wednesday, December 18, 2024

AI and the Arabic language: Preserving cultural heritage and enabling future discovery

The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....

  1. large language models ,
  2. llms ,
  3. jais ,
  4. Arabic language ,
  5. atlas ,
  6. language ,
  7. Arabic LLM ,
  8. United Nations ,
Read More