Bridging biology and AI with domain knowledge - MBZUAI MBZUAI

Bridging biology and AI with domain knowledge

Wednesday, March 18, 2026

When Haiyan Huang, Visiting Professor of Statistics and Data Science at MBZUAI, completed her doctoral program at the University of Southern California in the early 2000s, biology was entering a transformative era – driven by technologies capable of generating unprecedented amounts of data.

This turning point helped shape the trajectory of her research – a direction that was further sharpened during her postdoctoral fellowship at Harvard University. At this time, landmark efforts such as the Human Genome Project and the rapid rise of DNA microarray and other omics technologies were bringing experimental biologists and quantitative scientists together to tackle the challenges of large-scale genomic data.

“There were so many interesting problems out there for statisticians to solve,” Huang says as she reflects on this formative period that shaped her thinking about data and biology and encouraged her to be continually open to embracing new analytical methods.

But the challenges of data analysis are greater now than they were back then.

Early in her career, a biologist would have approached her with what in today’s terms would be considered a small dataset. The biologist would be interested to benefit from Huang’s expertise in statistical methods to answer a specific scientific question. And in these cases, traditional statistical methods would often work well.

That has changed.

Data produced in today’s labs are often highly dimensional, heterogeneous, and generated from different sources, such as sequencing experiments, imaging studies, and cell-based assays. This represents a shift from hypothesis-driven analysis to a more data-driven approach.

“Data science for large scale scientific applications is fundamentally about extracting meaningful signals from complex, noisy data, and translating them into scientific insight.” Huang says. Doing so requires methods that are “flexible enough to capture intrinsic patterns in the data, but also principled enough to be able to generate something scientifically meaningful.”

When data doesn’t speak for itself

In recent years, Huang has complemented her foundation in mathematics and statistics with machine- and deep-learning methods, which have demonstrated the ability to provide insights from large datasets that would be missed otherwise. But on their own, these approaches usually don’t automatically distinguish between patterns that are statistically detectable and those that are scientifically meaningful.

Huang has therefore explored incorporating domain knowledge into modeling frameworks to address important scientific questions.

“I’m focusing on integrating statistical principles, modern AI tools, and domain expertise to develop methods that are both scientifically informed and statistically grounded,” she says.

While the intention sounds straightforward enough, it’s difficult to execute, and requires close collaboration with scientists who have domain expertise.

Two projects are illustrative of what this looks like in practice and both were made possible by interdisciplinary collaborations.

In one project, Huang and her collaborators were analyzing a large gene expression dataset, looking for genes that might be acting together in a pathway in response to a biological condition. But when they applied a widely used clustering method on the full dataset, they didn’t find anything. The signal was buried in noise.

So they turned to their biologist collaborators, who provided a subset of about 80 genes that they believed to be the most likely suspects. When Huang and her collaborators repeated the same analysis on this suspect list, the results were strikingly different. Twenty-one of the 80 genes stood out, showing a very clear co-expression pattern. “The signals were there the whole time, but were obscured by too much noise,” she says.

This insight motivated them to develop an approach that combines modern and classical statistical ideas with domain knowledge to improve signal-to-noise ratios in the search for meaningful patterns – work that ultimately led to a study published in the Proceedings of the National Academy of Science.

The second project bridged chemistry, materials science, and data science. Huang collaborated with prominent polymer scientists to design synthetic polymers known as random heteropolymers, or RHPs, that can mimic certain behaviors of natural proteins in biological fluids. Unlike proteins, which are built from sequences of amino acids, RHPs are constructed from only a handful of chemical building blocks. From a data science perspective, Huang’s question was whether RHPs and target proteins could be embedded in a shared latent space whose structure reflects meaningful chemical and functional properties, thereby helping identify RHP designs that could potentially lead to protein-like behavior.

Huang and her collaborators again encountered a roadblock early on. They applied a variational autoencoder (VAE) to generate representations of the proteins and RHPs based on their sequences, but the model didn’t reveal any meaningful relationships between RHPs and target proteins.

After extensive discussions with their polymer scientist collaborators, they ultimately developed a hybrid VAE model that combined a classical VAE with an additional feature-based VAE. This modification encouraged the latent space to better separate meaningful chemical and biological patterns from unrelated variation, resulting in a clearer and more interpretable representation of the relationships between protein and RHP sequences.

Doing so “changed the story and generated mappings that suggested what kinds of designs would produce RHPs more likely to have protein-like function,” Huang says.

The team’s predictions were consistent with experimental results, and their work was published in the journal Nature. The hybrid VAE framework and its technical details were reported at the AAAI Workshop on AI to Accelerate Science and Engineering.

From prediction to explanation

Another area of interest for Huang is developing deep learning methods that are interpretable. Deep learning is often used as a “black box,” meaning that even if models provide relevant insights, they rarely explain how or why they arrived at their results. For a field like precision medicine, where the ultimate goal is to guide clinical decisions and improve patient care, this lack of transparency can be a significant limitation.

“When it comes to biomedical data, it’s possible to build a model that identifies hidden patterns and makes a prediction,” she says. “But from a physician’s point of view, they want to understand the underlying drivers of a disease, because that can help them better understand the disease and design a more effective strategy to treat it.”

Statistical AI for the real world

As Huang thinks about the future, she believes that deep learning and other AI-related fields will continue to make important contributions to biology and other scientific disciplines. For her own work, she wants to promote research in scientific and statistical AI that can make a tangible impact on the world. “I want to bridge statistical rigor, scientific applications, and interpretable AI,” she says.

Interdisciplinarity is an important element in this endeavor, and she encourages students to not only build a strong understanding of statistics, programming, and deep learning, but to also have some focus on a particular scientific discipline so that they can communicate with other experts in that field.  

And while the technologies and techniques that are used to produce and analyze data may change in the future, Huang says that she will continue to be interested in statistics and data science for the same reason that she was drawn to the field in the first place – it provides a method for telling scientifically informed and meaningful stories about data.

Related

thumbnail
Monday, March 16, 2026

Award-winning robotic fish take deep learning below the surface

MBZUAI's Cesare Stefanini helped develop the swam of underwater robots, winning the Sheikh Hamdan bin Zayed Award.....

  1. marine ,
  2. collaboration ,
  3. environment ,
  4. award ,
  5. partnership ,
  6. robotics ,
Read More
thumbnail
Friday, March 13, 2026

MBZUAI team awarded Google Academic Research Award to study loneliness in the age of AI

The project, led by Thamar Solorio, Monojit Choudhury, and Aseem Srivastava, will study loneliness in digital spaces.....

  1. social good ,
  2. loneliness ,
  3. GARA ,
  4. Google ,
  5. award ,
  6. nlp ,
  7. research ,
  8. natural language processing ,
Read More
thumbnail
Friday, February 27, 2026

MBZUAI launches Ruwwad AI Scholars Fellowship to build UAE’s next generation of AI faculty

The program will prepare Emirati Ph.D. graduates for future faculty careers by offering funded fellowships at leading.....

  1. research ,
  2. graduates ,
  3. talent ,
  4. Emirati ,
  5. fellowships ,
  6. postdoc ,
  7. program ,
Read More