The implications of advancing the scientific study of causal discovery, which seeks to identify causal relationships from observational data, are significant for many disciplines, like economics and epidemiology, and for our fundamental understanding of how things work in the world.
Over the past decades, scientists have developed methods and theorems that can be used to illuminate causal relationships from observational data, and these approaches have had practical benefits in many cases. But causal discovery is perhaps even more complex than it appears, as real causal interactions typically occur at a level of resolution which, in practice, is impossible to capture due to their physical scale and frequency of interactions.
“It’s very difficult to ensure the observational frequency matches the frequency of ground truth causal interactions,” said Shunxing Fan, a research assistant at the Mohamed bin Zayed University of Artificial Intelligence and coauthor of a study that was presented at the International Conference on Machine Learning (ICML 2024).
“The causal interactions always happen at a micro, even atomic, level,” Fan said. “In many cases, our observed data is at the macro level of the underlying process.”
The study, conducted by a MBZUAI researchers and faculty, argues that the results of prevailing causal discovery methods may be “distorted” by the fact that data used in causal discovery is aggregated, meaning these measurements are an of average or summation of the real interactions. This is one of the most fundamental problems in causal discovery.
“Observational data is often a lossy version of information from real-world systems. For example, the data may be discretized, have missing values, or be subject to selection bias. Aggregation is also one of these situations,” Fan said. “When we perform aggregation, we lose information.”
One thing leads to another
The authors provide a common example to illustrate how aggregated data is often used. On days when the temperature is high, ice cream sales increase. By considering aggregated data on temperature and ice cream purchases, one could argue that there is a causal relationship between temperature and sales of ice cream.
The authors note, however, that in reality there is a time-lag between the two events: “a high temperature at a specific past moment influences people’s decision to purchase ice cream, which then leads to a sales transaction at a subsequent moment.” It’s only because these phenomena are observed through aggregated data — the daily high temperature and total ice cream sales for that day — that it seems like there is “instantaneous causality” between temperature and ice cream sales.
In this and many other cases, it’s theoretically possible to assume that one variable and another variable may have a causal relationship, but observed data is not really those variables but the average of those variables, Fan explained. “For particular time points, we may be able to say that x will cause y. But we can’t really get the data for x and y, we can only get the average of x and y over a period of time.”
If the goal of an analysis is to get information about correlations, then average data may be adequate, Fan said. However, recovering causal relationships requires more information, which may be distorted after aggregation. There are few studies that discuss the conditions under which aggregated data are sufficient to illuminate causal relationships and why aggregation destroys the performance of causal discovery methods.
From theory to practice
Fan studied statistics as an undergraduate and turned his attention to causality under the supervision of Professor Kun Zhang because causality can help people understand the world better. Today, he’s interested in how causal discovery can be used in practical, real-world scenarios. “There have been developments in theoretical causal discovery, but in real-world scenarios there is no guarantee that outputs of the causal discovery method are correct,” Fan said.
He noted that even though many causal discovery methods have theoretical guarantees under certain assumptions, the data in the real world often violate these assumptions. “This is why we need to focus on the real-world issue of causal discovery and be more careful when we do causal discovery in aggregated data,” Fan said.
From optimal decision making to neural networks, we look at the basics of machine learning and how.....
MBZUAI research shows how a better understanding of the relationships between variables can benefit fundamental scientific research.
A team from MBZUAI used instruction tuning to help multimodal LLMs generate HTML code and answer questions.....