In recent years, AI developers have built and refined what are known as foundation models: large machine-learning systems that are trained on enormous and broad datasets with the goal of giving them the ability to complete a variety of tasks. OpenAI’s GPT series and Google’s BERT are two examples of foundation models that are designed to process language. Meta’s Segment Anything Model (SAM) is, as its name suggests, a foundation model built to segment, or identify, objects in any type of image or video.
SAM is impressive in terms of its size and performance. The dataset that was used to train it is made up of 11 million images and more than a billion segmentation masks, which identify boundaries of objects. Meta describes the model as having a “general notion of what objects are”, and it can also identify objects it hasn’t been exposed to previously, an ability known as zero-shot generalization. Yet while SAM’s performance on everyday images is strong, it struggles with images from specialized fields, like medicine.
Scientists at the Mohamed bin Zayed University of Artificial Intelligence have recently developed an efficient method that takes advantage of the broad capabilities of SAM while significantly improving its performance on medical images. The research, led by Chao Qin, a Ph.D. student in computer vision at MBZUAI, was nominated for the Best Paper Award at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention, which took place in October in Marrakesh, Morocco.
Medical images, like those generated by CT (computed tomography) and MRI (magnetic resonance imaging) technologies, are quite different from the data SAM was trained with. “When we apply SAM to medical images, we find that there is a large gap between its performance with natural images and its performance with medical images,” Qin says.
Retraining SAM — or any foundation model — to improve its performance on a specialized subset of images would take time and be costly. It would also require a large dataset of medical images, which are scarcer than everyday images, Qin adds.
Another option for improving performance is a technique known as fine tuning. Qin and his colleagues write in the study that fine tuning has limitations, however, and this approach would not “fully harness domain-specific (medical) knowledge.” Earlier this year, a team of scientists from the University of Toronto, New York University and Yale University developed what they call MedSAM, which employs fine tuning on a component of SAM that plays a role in masking objects in images. MedSAM marked a significant improvement compared to the general SAM, resulting in an increase in performance of more than 22% across a variety of segmentation tasks. That said, MedSAM still failed to correctly segment nearly 19% of the images it was presented with.
Qin and his colleagues turned to a different technique called an adapter that has often been used in the field of natural language processing but has yet to be taken up widely in computer vision. Instead of retraining an entire model, adapters make it possible to train a smaller part of a model for a specific task. This approach ‘freezes’ the parameters of the base model while the adapter layers are trained. It’s efficient because only a small part of the overall model is trained.
Qin and his colleagues call their innovation DB-SAM, for dual-branch SAM, as it is comprised of one branch that contains a vision transformer (ViT) and another that contains a convolutional neural network (CNN).
ViTs and CNNs each have their own strengths. ViTs excel in making sense of global information from images. CNNs on the other hand perform well processing local information. The information from the two branches is fused together at the end of the process. “It’s not trivial to fuse information from a transformer and a CNN,” Qin explains. “We designed a bilateral cross-attention block to fuse the two features.”
Qin and his colleagues evaluated DB-SAM on a large body of 2D and 3D images that was compiled from 30 public medical datasets. DB-SAM outperformed both SAM and MedSAM by a significant margin. According to two metrics known as dice similarity coefficient (DSC) and normalized surface distance (NSD), DB-SAM showed an increase in performance on 3D segmentation tasks of 6% and nearly 8%, respectively, compared to MedSAM. On 2D segmentation, DB-SAM was better by 4.8% (DSC) and 8.6% (NSD).
“When I started this research, I never imagined that we could achieve such good performance,” Qin says. “But of course there were also failure cases and it’s interesting to study these because they can help us design more sophisticated structures to improve the model in the future.”
Qin envisions a future in which foundation models will play a bigger role in specialty fields like medicine. “There are lots of models that have been designed for one purpose, say for one organ, or one disease. But we want to design a foundation model that can segment all organs of the human body,” he says.
It’s beneficial for hospitals to have one model for use across clinical applications, as a single large model is likely more economical than several, more specialized models. Foundation models also have the advantage that they improve by being exposed to different kinds of data, Qin explains. Images of organs have complimentary information about disease and how images should be segmented.
For Qin, perhaps the most important insight from this work relates to his and his colleagues design philosophy and the ease with which adapters can be applied to foundation models today and tomorrow. “There will always be better foundation models, and we can design a new adapter to be used with them,” he says.
Scientists at MBZUAI have developed a new method of predicting survival times for head and neck cancer.....
The new machine learning method can analyze electronic medical records and could help physicians identify patients who.....
Hisham Cholakkal and his team explain why they are focusing on near real time multilingual conversational systems.....