Home / News / A multimodal approach for developing medical diagnoses with AI

A multimodal approach for developing medical diagnoses with AI

Thursday, November 14, 2024

A doctor examines at a digital copy of a chest x-ray.

Mai A. Shaaban, a doctoral student in machine learning at the Mohamed bin Zayed University of Artificial Intelligence, is developing AI tools that help physicians quickly and accurately diagnose disease.

“I’m aiming to speed up the diagnosis process so that models can provide physicians with support for decisions they make about patients,” she says, noting that faster and more accurate diagnoses could significantly improve patient outcomes by enabling quicker access to care.

Shaaban, along with colleagues at MBZUAI and Carleton University in Canada, has developed a first-of-its-kind system that is designed to analyze chest X-rays and other patient data to aid physicians in diagnosing lung diseases and injuries. Their innovation was recently presented at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), which was held in Marrakesh, Morocco.

Multimodal data analysis

When diagnosing a patient, doctors analyze different kinds of information, including electronic health records (EHR), which contain lab results and written notes, and medical images like X-rays. Trained physicians can make sense of this so-called multimodal data, but it is a significant challenge for machines. “The future of AI in health care will be based on multimodality and will require systems to have the capability to process EHR, images and other kinds of data,” Shaaban says. “We need to develop ways to integrate their analysis.”

Shaaban and her colleagues call their system MedPromptX. It uses multimodal large language models and techniques called visual grounding and few-shot prompting to make diagnoses based on EHR and X-ray images.

Shaaban and her colleagues created a dataset of historical patient records gathered from large datasets called MIMIC-IV and MIMIC-CXR that were developed at a hospital in Boston. The team’s new dataset, named MedPromptX-VQA, includes more than 6,000 patient records that include chest X-rays. Each patient record is labeled with a medical condition, such as pneumonia or edema.

While electronic health records like those included in the MedPromptX-VQA dataset include valuable information, they are almost never complete, which causes problems for AI systems. For example, lab results are typically accompanied by contextual information that indicates what can be considered a normal range for a specific test. But this data isn’t always included in EHR, making it difficult for machine-learning systems to interpret results, Shaaban explains.

She and her colleagues turn to language models to help fill in this missing information. “We need to provide knowledge to models about normal and abnormal ranges, and this information exists in large language models because they are already trained on large amounts of medical data and have likely found this information in medical textbooks,” she says.

Improving performance

The researchers give X-rays and corresponding EHR from the MedPromptX-VQA dataset to their model. They employ a technique called visual grounding to help the system focus on relevant sections of X-ray images.

EHR are translated from what’s known as structured or tabular data into text. The model then translates the subsections of images that have been selected through visual grounding, as well as the text version of the EHRs, into a format called an ’embedding’ that can be interpreted by a machine-learning model. These embeddings provide background information to the system about relationships between X-rays, EHR and diagnoses. When the model is asked about a new patient, it draws upon similar historical cases through a process Shaaban and her colleagues developed called dynamic proximity selection, which helps predict the new patient’s condition.

Dynamic proximity selection identifies similar cases based on a threshold value, allowing only the most relevant examples to guide the model’s predictions. “With dynamic proximity selection we exclude ambiguous examples, and only include the examples that are similar to the patient,” Shaaban explains. This method improves the accuracy of diagnoses while reducing the model’s dependency on labeled datasets, which take a significant amount of time to create. “We are giving examples to the model to guide it to provide the best answer to the question that is being asked,” she adds.

Shaaban and her team also employed a technique called few-shot prompting that helps the model understand the kind of responses that are expected by showing it examples. In this case, the model was given several examples, ranging from four to 12, alongside a new query, to help it understand the desired output. Few-shot prompting can help users tailor the performance of a system to their needs without having to retrain or fine-tune it: processes that require time, resources and technical expertise.

To process the prompts and provide a decision about patient cases, the researchers use a multimodal language model called Med-Flamingo, which had been pre-trained on medical data and can process both text and image inputs.

Results

The scientists compared the performance of MedPromptX to four other models: BioMedLM, Clinical-T5-Large, Med-Flamingo and OpenFlamingo. They did not test the other models using dynamic proximity selection and they did not apply visual grounding to these models due to differences in how the models handle data. Med-Flamingo and OpenFlamingo were able to process X-ray images and the text prompt, but not EHR data. BioMedLM and Clinical-T5 processed text and EHR data. MedPromptX was the only model that was provided X-ray images that were processed using visual grounding, EHR text and the prompt.

MedPromptX performed better (using dynamic proximity selection and visual grounding) than the other models on precision (77.3%), F1 score (68.6%) and accuracy (68.9%). Clinical-T5-Large was the next-best performing model, with scores of 70.7%, 57.6% and 59.5%, on the same metrics. Without visual grounding, MedPromptX achieved a recall rate of 58.1%, outperforming the other models. The next best performing model was, again, Clinical-T5-Large, at a rate of 37.1%.

Out in the world

While MedPromptX represents an advancement in the development of AI systems that can help physicians diagnose patients, Shaaban recognizes that true validation of MedPromptX’s capabilities will come from real-world testing. Implementing the model in clinical settings, where it can be used with a wide variety of patient data, will be essential to fully assess its impact and reliability. Even so, systems like MedPromptX are designed to support, not replace, physicians, providing them with valuable insights that help make diagnoses more precise and efficient.

Tuesday, June 30, 2026

A multimodal approach for developing medical diagnoses with AI

Multimodal data analysis

Improving performance

Results

Out in the world

Related

Foundational research in the age of AI

The watermark that wasn't there

Solving a fundamental problem in causal discovery

About

Resources

Programs

Calendars

A multimodal approach for developing medical diagnoses with AI

Multimodal data analysis

Improving performance

Results

Out in the world

Related

Foundational research in the age of AI

The watermark that wasn't there

Solving a fundamental problem in causal discovery

Subscribe to The Node