Over the past several years, scientists have designed computer vision models that are able to perform many different tasks. These systems have been used to point out disease in medical images, aid in navigation and identify changes in the environment.
Since these systems are powered by machine learning, they must be trained by processing huge amounts of data. Once trained, they are evaluated by testing their performance on datasets that weren’t used in training. And over the past decade, there has been one dataset, called ImageNet, that has served as a standard for evaluating the performance of computer vision models.
In a recent study, however, researchers at Meta AI Research and the Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) have found that the performance of computer vision models on ImageNet is not always an accurate indicator of how the models will perform on specific tasks.
In their study, the researchers consider four popular configurations of computer vision models and show that while the overall accuracy of the different configurations on ImageNet data may be similar, the behaviors of the models on certain types of images vary. Their findings indicate that one configuration of a model may be better suited to a particular task than another.
The research is being presented at the International Conference on Machine Learning (ICML 2024), which is being held this month in Vienna. Researchers from MBZUAI are authors on 25 studies that will be shared at the conference, which is one of the largest and most significant annual meetings in the field of machine learning.
“We provided a detailed analysis of different architectures and training paradigms across different behaviors,” said Zhiqiang Shen, assistant professor of machine learning at MBZUAI. “We found that the single metric of ImageNet does not fully capture performance nuances that are critical for specialized tasks.”
Standardized testing
One benefit of testing the performance of a model on ImageNet is that it is a large and diverse dataset that provides information about how a model will perform across a wide variety of images. Another benefit is that ImageNet serves as a benchmark — the performance of one model can be compared to the performance of other models that have also been tested on the dataset.
But the overall performance of a model on a benchmark like ImageNet only says much. “A higher performance on ImageNet does not mean that it will be the best on specific task,” Shen said. “A model could have lower performance on ImageNet but be better for a particular task than another model that had higher performance on ImageNet.”
In addition, as the researchers write in the study, “simply identifying mistaken object classes,” which is the basic task with ImageNet, “might not offer actionable insights for model improvement. The key aspect, therefore, is finding the specific reasons for these mistakes.” The authors therefore propose that by identifying the kinds of mistakes a system makes, researchers can develop ways of retraining it and improving it.
In these experiments, the authors analyzed two different model architectures: convolutional neural networks, known as ConvNets, and transformers. They also looked at two training paradigms, supervised learning and contrastive language-image pretraining, or CLIP. This provided them with a total of four configurations to analyze.
The four models analyzed in the study are a kind of battle of old versus new. ConvNets have been used for decades, while transformers were first proposed by a team from Google in 2016. Supervised learning is a mature technique, while CLIP was pioneered by OpenAI more recently. “One of our motivations is that there is a debate about what is better when it comes to architectures, ConvNet or transformers,” Shen said. “But for one particular scenario, a transformer would be better, while for another scenario the ConvNet would be better. People need to choose the right architecture and training method for their task.”
Fit for purpose
As it relates to model architecture, the researchers found that ConvNets made more mistakes related to texture in images than did transformers. ConvNets performed better than transformers on synthetic data.
They also found that supervised models are more robust than CLIP models, meaning that they perform better on a wide variety of data. That said, CLIP models process “abstract or creative visuals” better than supervised models.
All the models struggled to categorize images in which the main object was partially hidden, or occluded, as this characteristic is described in the field.
There have previous studies on this topic, Shen noted, but those didn’t provide a comparison of model configurations in a unified setting.
At the most fundamental level, the goal of the research is to find ways to make machine learning models more efficient and more accurate, Shen said. “The first step, of course, is that you need to choose the correct architecture and then you can determine ways to make the model more efficient for your task.”
Shen also noted that while large and well-funded tech companies that are developing computer vision models can simply improve performance by making models bigger and training them with more and more data, smaller companies and researchers with limited resources would benefit by choosing the right configuration from the start.
“Large companies can just train the model and get the desired performance,” Shen said. “But if you have a very efficient configuration, maybe you don’t need to spend so much on training.”
The students won the best student paper runners up award at ACCV for their new method called.....
A team from MBZUAI presented a new approach for optimizing neural networks at the recent NeurIPS conference.
From optimal decision making to neural networks, we look at the basics of machine learning and how.....