Today, for the most part, robots are limited to controlled environments, like assembly lines and warehouses. Sure, some people have robotic vacuum cleaners that buzz around their floors, sucking up dust and crumbs. But beyond that, it’s rare to see robots that can navigate dynamic, real-world environments.
Part of the reason is that it’s difficult to design machines that can work in spaces that are constantly changing. If robots are to be more useful in our day-to-day lives, they will need to be able to quickly analyze dynamic scenes and react to them.
Scientists from MBZUAI are working towards this goal by improving machines’ ability to recognize objects. Together with researchers from other institutions, a team from MBZUAI recently developed a new method that was shown to be more accurate and significantly faster than previous approaches in a task known as open-vocabulary 3D instance segmentation.
The research was led by Mohamed El Amine Boudjoghra, a graduate of the master’s program at MBZUAI and now a doctoral student at Technical University of Munich (TUM). Angela Dai of TUM and Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan and Fahad Shahbaz Khan of MBZUAI are co-authors on the study. The findings will be shared in an oral presentation at the 13th International Conference on Learning Representations, held at the end of April in Singapore.
Boudjoghra began his academic career working on problems related to robotic control but has more recently become interested in computer vision and 3D scene understanding. He sees robotic control and computer vision as affiliated disciplines, since innovations in both fields are necessary if scientists are to build increasingly capable and useful robots. “Computer vision is about getting information from an environment, while robotic control is about executing on that information,” he says. “In my opinion, you can’t build control systems without computer vision.”
Boudjoghra and his colleagues call their system Open-YOLO 3D, and it’s designed for open-vocabulary 3D instance segmentation, which enables it to detect and differentiate individual objects in a 3D scene without being limited to predefined sets of object categories. Take, for example, a theoretical robot that has been directed to rearrange chairs in a conference room. 3D instance segmentation makes it possible for the robot to recognize the chairs as chairs but also each individual chair as a discrete and identifiable object.
To do this, machines need to collect information about the environment. They do this through sensors of different kinds. Cameras generate 2D images, while other technologies, like lidar (light detection and ranging), are used to generate 3D representations in a format known as a 3D point cloud. Both technologies have their benefits. Images from cameras contain information about objects — chairs, tables, windows — that systems can easily interpret and classify. 3D point clouds contain detailed information about a scene’s geometry, but don’t include data that can easily be used for classification. The key challenge in 3D instance segmentation is linking features from 2D images to the spatial information provided by the 3D point cloud.
Open-YOLO 3D improves on a previous method called OpenMask3D, which was the first designed to perform 3D instance segmentation on objects in a zero-shot manner, meaning it could work with objects it didn’t see in training. OpenMask3D was designed by a team from ETH Zurich, Microsoft and Google and it relies on two key technologies: a system called SAM (segment anything model) that was designed by researchers at Meta to identify, or segment, objects in 2D images; and a vision-language system known as contrastive language-image pretraining, or CLIP, that associates images to text and was developed by OpenAI. While OpenMask3D is innovative, it works slowly, taking five to 10 minutes to analyze a scene.
Boudjoghra wanted to develop a system that performed as well as OpenMask3D but faster. “Five to 10 minutes is fine under some scenarios if the scene doesn’t change often,” Boudjoghra says. But in many situations, people move through spaces, rearranging objects and introducing new ones. “If a human comes and changes a scene, we will have to redo it. Instead of taking minutes, we want to take a few seconds.”
The basic idea of Open-YOLO 3D is to identify objects in 2D images and project information from the 3D point cloud on to those images. This creates a representation of a space that contains both information about objects and their exact location in it.
Instead of performing detailed pixel-by-pixel segmentation of objects like OpenMask3D, Open-YOLO 3D assigns labels to objects in 2D images using what the researchers call low-granularity label maps. These are constructed by overlaying bounding boxes from a 2D object detector onto each image and replacing the pixels within each box with a predicted class label. The 3D points from the point cloud are projected into these images and the points inherit the corresponding class labels from the low-granularity label maps. All this information can be melded because the system can relate the exact position from which an image was taken to the 3D point cloud by using what are known as the intrinsic and extrinsic parameters of the camera.
In addition, instead of using CLIP, the researchers use a method called multi-view prompt distribution (MVPDist) to label features. MVPDist aggregates class labels across multiple images, ensuring that the final classification for each object is based on the most frequently occurring labels.
Open-YOLO 3D segments objects in 3D scenes. The researchers compared the performance of Open-YOLO 3D to the current state-of-the-art system, called Open3DIS, on a benchmark dataset. Open-YOLO 3D was more precise when segmenting objects compared to Open3DIS.
Boudjoghra and his colleagues tested Open-YOLO 3D on two standard datasets. The results demonstrate high accuracy and a substantial increase in processing speed compared to existing methods.
Key findings include a mean average precision (mAP) of 24.7% on the ScanNet200 validation set, which reflects an absolute gain of 2.3% compared to another method called Open3DIS. Perhaps more important, Open-YOLO 3D was approximately 16 times faster, operating at 22 seconds per scene, while other methods could take minutes.
“I expected it to be faster, but I was surprised that it was doing better in terms of accuracy results,” Boudjoghra says. The results illustrate Open-YOLO 3D’s effectiveness in providing fast and accurate 3D instance segmentation and are a step towards enhanced performance in real-world scenarios where both speed and precision are required. “Innovations like these are going to be good for dynamic scenes that change over time,” he adds.
Much progress needs to be made if we are to have robots that help us with chores and tasks in our homes and at work. This includes innovations in robotic control and computer vision, but there are developments in other fields that need to happen as well, Boudjoghra explains. For example, today, robots can’t be given a general task and be expected to make a plan to execute on it. “I can’t tell a robot to clean up the apartment while I’m at work,” a task that not only requires navigation, scene understanding, and object recognition, but also complex reasoning, Boudjoghra says.
He believes, however, that the increasing reasoning capability of large language models will someday help machines work more independently without the need for specific instructions. “This kind of thing isn’t possible today, but in the future, with more innovations, they may be able to navigate and execute tasks that aren’t so clear.”
Professor Sami Haddadin explains how his innovative classification methodology could optimize the potential of robots and accelerate.....
MBZUAI Provost and Professor of NLP, Tim Baldwin, looks at the AI innovations, advances and challenges we.....
Developed by MBZUAI scientists, the new dataset will enable greater cultural and linguistic inclusivity in multimodal LLMs.