New machine-learning approach to inform cancer prognoses presented at MICCAI

Thursday, October 24, 2024

Despite advances in treatment, cancers remain some of the world’s most deadly diseases, leading to more than 10 million deaths across the globe each year, according to the World Health Organization. Unfortunately, that number is set to grow. By 2040, the WHO estimates that 30 million people will die yearly from cancers.

When a patient is diagnosed with cancer, physicians turn to statistical tools called survival models to aid them in developing a prognosis. Predictions produced by survival models influence how physicians decide to treat the disease, and accurate predictions can significantly improve patient outcomes.

Over the past several years, scientists have developed machine-learning techniques to improve the performance of survival models, but making accurate predictions remains a significant challenge due to the complexity of the data that these predictions are based on.

Machine-learning models must be designed to process multimodal data — such as electronic health records, images of tumors, and lab reports — and make comprehensive sense of these different types of information. Another obstacle is that important data is often missing from patient records. These missing values, known as censored data, make training machine-learning models even more difficult.

A team of scientists at the Mohamed bin Zayed University of Artificial Intelligence, led by Numan Saeed, a postdoctoral fellow in computer vision, recently developed a new method that can be integrated into survival models. Their approach makes use of multimodal data while also taking account of censored data. Saeed and his colleagues call their innovation survival rank-n-contrast (SurvRNC) and it is designed to predict survival times for head and neck cancer patients based on patient health records.

Saeed recently presented the team’s work at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), in Marrakesh, Morocco.

Making sense of missing information

In a perfect world, patient records would be complete and contain all relevant information. The reality is different. In some cases, a record will paint a full picture of a patient’s medical history, with lab results, dates of diagnosis and treatment, and, perhaps, the point at which they died. In other cases, a record may be limited in that it will include the date on which a patient was diagnosed and the day of their last visit to the doctor. “We may know that 100 days after diagnosis the patient was alive, but what happened to them after that is not captured in the dataset,” Saeed says.

The data contained in incomplete records is still valuable, even if it isn’t comprehensive, Saeed explains, but it must be processed differently than data from a complete record. “We need to tell the model that for censored patients, we don’t know when they died, but we knew they were alive at a certain point,” he says.

According to this approach, patients who have complete records, referred to as certain patients, receive more weight in the SurvRNC model, while patients who have missing data, referred to as uncertain patients, receive less weight. (Weight refers to the importance the model places on the data.) “In any prognosis problem, the censored patients are always a challenge, and there are different ways to give them more or less weight,” Saeed says.

Complexities of predicting survival

Saeed and his colleagues worked with a large dataset called HECKTOR that includes information on nearly 500 patients with head and neck cancer, one of the most difficult-to-treat and deadly cancers. The dataset includes medical images, such as CT (computed tomography) and PET (positron emission tomography) scans, electronic health records and information related to time of diagnosis and time of treatment. Approximately 70% of the records in the dataset don’t include conclusive information about how long the patient survived following diagnosis. Despite the missing data, which is a common problem in medical datasets, Saeed explained that HECKTOR is “very good for this kind of work because of the diversity of hospitals at which the data was collected and the variety of machines that were used.”

Machine-learning models have long been used for what are known as classification tasks. In the medical field, this could be categorizing patients into groups based on their risk of developing a disease. Predicting survival time, however, is a different kind of problem, called a regression task. Regression tasks produce what are known as ordered representations, which are values along a continuum. In the case of SurvRNC, the ordered representation is a value of time. Regression tasks are generally harder to solve than classification tasks because they require a more nuanced interpretation of data. Subtle differences between records — for example, a small difference in the location or size of a tumor — can have a major impact on a patient’s outcome and must be considered by the model to make an accurate prediction.

Saeed and his colleagues’ approach processes multimodal patient records, creating ordered representations of patients as embedding vectors, which are plotted in an embedding space. Some patient records include survival time, others don’t, and these records are handled differently by the SurvRNC. Patient records that include survival times and share survival times that are similar are grouped as “positive pairs” and are moved near one another in the embedding space. Patients who show a larger difference in survival time, called “negative pairs,” are pushed away in the embedding space. Patients who are missing survival time are associated with other similar patients in the embedding space, but these “uncertain pairs” are given less weight by the model.

The researchers compared the performance of SurvRNC to other state-of-the-art machine-learning survival models on the HECKTOR dataset. SurvRNC displayed an improvement of 3.6% over the next best-performing system.

Next steps

Saeed intends to continue to improve SurvRNC and hopes to implement the system in a trial at a medical center. The system could also be used to predict survival in other cancers as well, such as colon and breast cancer, he says.

“New AI tools can be extremely helpful, especially in remote areas around the world and in countries that lack expensive equipment that is used to treat cancer in more developed countries,” Saeed says. “With this work, we hope to be able to reduce the burden on doctors so that they can more quickly determine what the best treatments are for their patients.”

Related

thumbnail
Wednesday, January 15, 2025

Cultural inclusivity in AI: A new benchmark dataset on 100 languages

Developed by MBZUAI scientists, the new dataset will enable greater cultural and linguistic inclusivity in multimodal LLMs.

  1. inclusivity ,
  2. linguistics ,
  3. benchmark ,
  4. languages ,
  5. multimodal ,
  6. inclusion ,
  7. llms ,
  8. dataset ,
  9. large language models ,
  10. computer vision ,
Read More
thumbnail
Monday, January 13, 2025

MBZUAI students win award for study presented at Asian Conference on Computer Vision

The students won the best student paper runners up award at ACCV for their new method called.....

  1. students ,
  2. ACCV ,
  3. award ,
  4. computer vision ,
  5. research ,
  6. student achievements ,
Read More
thumbnail
Wednesday, December 18, 2024

AI and the Arabic language: Preserving cultural heritage and enabling future discovery

The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....

  1. large language models ,
  2. llms ,
  3. jais ,
  4. Arabic language ,
  5. atlas ,
  6. language ,
  7. Arabic LLM ,
  8. United Nations ,
Read More