Perhaps the most important benefit of scientific inquiry is that it provides an understanding of why things happen and how interventions can be made to make them happen differently. These “why” questions are at the heart of the field of causal discovery, which uses algorithms to analyze variables and discover causal relationships between them.
This concern with causality sets causal discovery apart from traditional machine learning that is to a large extent informed by statistical probability. Take, for example, large-language models powered by machine-learning algorithms. These systems are very good at recognizing patterns in data and can quickly generate coherent and often accurate answers. That said, it’s not as though these systems understand whether the answers they generate are right or wrong in any fundamental way.
“When you try to do prediction, there is no true model, you’re just trying to find something that is optimal or to improve performance” of the system, says Kun Zhang, associate department chair of Machine Learning (research), director of the Center for Integrative Artificial Intelligence (CIAI), and visiting professor of Machine Learning at MBZUAI.
“With causal discovery, we are trying to recover the truth with correctness guarantees, which makes it more challenging than traditional machine learning that’s focused on predictions.”
Zhang is co-author of a study that proposes a new causal discovery algorithm that fills an important gap in the causal discovery toolkit. The study, “Distributional Equivalence in Linear Non-Gaussian Latent-Variable Cyclic Causal Models: Characterization and Learning”, was shared in an oral presentation at the 14th International Conference on Learning Representations in Rio de Janeiro (ICLR). Zhang’s co-authors are Haoyue Dai, formerly a visiting student at MBZUAI, Immanuel Albrecht, and Peter Spirtes.
Causal discovery is complicated by the fact that some important variables, known as latent variables, can’t be observed directly. For example, while a psychological dataset may include survey responses that can be used to build an understanding of a person’s psychology, the responses themselves are simply representations of underlying personality traits. Or consider a field like economics, where prices and market behavior are often shaped by forces that don’t necessarily appear in a dataset.
Accounting for latent variables is one of the central challenges of causal discovery. It’s also where existing methods run into trouble, Zhang says. Because modeling real-world phenomena is so complex, methods have typically relied on what are known as structural assumptions, constraints imposed in advance that determine what kind of causal relationships are allowed by a model.
Some methods assume that latent variables can only influence observed variables in certain ways. Others assume that there are no feedback loops, even though these often happen in the real world. Assumptions are useful and necessary in many cases, but they can also force constraints onto data that don’t match up with reality, generating misleading results.
Another reason that causal discovery methods have relied on assumptions is that no one had solved a more fundamental problem. Before you can design a method to find the right causal model, you need to know which models are distinguishable from each other by using data. But when different causal models produce the same observed data distribution, no algorithm can choose between them. This collection of causal models that produce indistinguishable data distributions is called an equivalence class, and without an understanding of the equivalence class, you don’t know what insights your method is capable of recovering.
Zhang offers a simple example. Imagine there are two variables (X and Y) that have a linear relationship and a Gaussian distribution. If the variables display covariance, the distribution of the two variables looks the same whether X influences Y or Y influences X. And yet, from a causal perspective, it matters which variable influences another. “In one case, you would say that the symptoms cause the disease and in the other the disease causes the symptoms,” he says. They can’t both be right.
For complex systems with many observed and latent variables, the number of equivalent causal models becomes huge. Previous methods addressed this by imposing assumptions that shrunk the equivalence class to something manageable. But this meant that the methods were only valid when the assumptions held.
Zhang and his co-authors’ study takes the opposite approach. Rather than constraining the problem with assumptions, they first ask: what is the full equivalence class with no structural assumptions at all?
In their study, the researchers establish for the first time a characterization of distributional equivalence for linear, non-Gaussian models with arbitrary latent variables and feedback loops. This means that for any two causal models of this type, the researchers can determine whether they are observationally distinguishable or not. They can also analyze all models in an equivalence class and recover that class from data without imposing structural assumptions on how the latent variables must behave.
A tool called “edge ranks” is central to their approach. Previous work in this area used a concept called “path ranks,” which measure how information moves through a causal graph. Path ranks are useful, but they can be difficult to work with, as changes to them can have global effects on the graph. As the researchers explain in their study, edge ranks act more locally, are “easier to manipulate,” and complement path ranks.
Based on this framework, the researchers developed an algorithm called glvLiNG (general latent variable linear non-Gaussian causal discovery). It can “traverse” an entire equivalence class efficiently, meaning that it can systematically move through equivalent models, identifying which causal features are certain and which ones are ambiguous based on the data.
Zhang says that the team’s approach is a “breakthrough in the sense” that under mild assumptions it can enumerate all possible solutions in an equivalence class and provides methods for traversing those solutions and recovering equivalence from data.
The researchers conducted several experiments with glvLiNG, including one on a dataset of the daily stock price of 14 companies listed on the Hong Kong Stock Exchange. The algorithm identified an equivalence class of more than 19,000 causal graphs with two latent variables. It also identified meaningful causal patterns, with large banks acting as upstream causes and real estate companies appearing to be downstream receivers of effects. One of the latent variables was interpretable, describing a shared ownership structure between companies.
To achieve this result, the researchers had to solve a difficult graph theory problem that was outside of their area of expertise. This led the lead author of the study, Haoyue Dai, to reach out to mathematicians who specialize in graph theory. One of those who responded was Immanuel Albrecht, a professor at FernUniversität in Hagen, Germany. His expertise was a critical contribution to the concept of edge ranks.
Zhang says that at a moment when much of the field’s attention is on scaling and applications, this study is a reminder that fundamental open problems still exist and addressing them can lead to important insights.
“It’s an example of how innovations in causal discovery can really shape the next generation of models and how people from other fields, like mathematics, can help us build better AI systems and produce something new,” he says.
A new framework from MBZUAI researchers enables institutions to uncover shared patterns across datasets while keeping sensitive.....
Abdulla Almansoori explains how he is using machine learning to give back to the institutions that made.....
Read More