The complexities of identifying causality in the real world: A new study presented at ICML

Saturday, July 27, 2024

The implications of advancing the scientific study of causal discovery, which seeks to identify causal relationships from observational data, are significant for many disciplines, like economics and epidemiology, and for our fundamental understanding of how things work in the world.

Over the past decades, scientists have developed methods and theorems that can be used to illuminate causal relationships from observational data, and these approaches have had practical benefits in many cases. But causal discovery is perhaps even more complex than it appears, as real causal interactions typically occur at a level of resolution which, in practice, is impossible to capture due to their physical scale and frequency of interactions.

Shunxing Fan

“It’s very difficult to ensure the observational frequency matches the frequency of ground truth causal interactions,” said Shunxing Fan, a research assistant at the Mohamed bin Zayed University of Artificial Intelligence and coauthor of a study that was presented at the International Conference on Machine Learning (ICML 2024).

“The causal interactions always happen at a micro, even atomic, level,” Fan said. “In many cases, our observed data is at the macro level of the underlying process.”

The study, conducted by a MBZUAI researchers and faculty, argues that the results of prevailing causal discovery methods may be “distorted” by the fact that data used in causal discovery is aggregated, meaning these measurements are an of average or summation of the real interactions. This is one of the most fundamental problems in causal discovery.

“Observational data is often a lossy version of information from real-world systems. For example, the data may be discretized, have missing values, or be subject to selection bias. Aggregation is also one of these situations,” Fan said. “When we perform aggregation, we lose information.”

One thing leads to another

The authors provide a common example to illustrate how aggregated data is often used. On days when the temperature is high, ice cream sales increase. By considering aggregated data on temperature and ice cream purchases, one could argue that there is a causal relationship between temperature and sales of ice cream.

The authors note, however, that in reality there is a time-lag between the two events: “a high temperature at a specific past moment influences people’s decision to purchase ice cream, which then leads to a sales transaction at a subsequent moment.” It’s only because these phenomena are observed through aggregated data — the daily high temperature and total ice cream sales for that day — that it seems like there is “instantaneous causality” between temperature and ice cream sales.

In this and many other cases, it’s theoretically possible to assume that one variable and another variable may have a causal relationship, but observed data is not really those variables but the average of those variables, Fan explained. “For particular time points, we may be able to say that x will cause y. But we can’t really get the data for x and y, we can only get the average of x and y over a period of time.”

If the goal of an analysis is to get information about correlations, then average data may be adequate, Fan said. However, recovering causal relationships requires more information, which may be distorted after aggregation. There are few studies that discuss the conditions under which aggregated data are sufficient to illuminate causal relationships and why aggregation destroys the performance of causal discovery methods.

From theory to practice

Fan studied statistics as an undergraduate and turned his attention to causality under the supervision of Professor Kun Zhang because causality can help people understand the world better. Today, he’s interested in how causal discovery can be used in practical, real-world scenarios. “There have been developments in theoretical causal discovery, but in real-world scenarios there is no guarantee that outputs of the causal discovery method are correct,” Fan said.

He noted that even though many causal discovery methods have theoretical guarantees under certain assumptions, the data in the real world often violate these assumptions. “This is why we need to focus on the real-world issue of causal discovery and be more careful when we do causal discovery in aggregated data,” Fan said.

Related

thumbnail
Wednesday, December 25, 2024

Machine learning 101

From optimal decision making to neural networks, we look at the basics of machine learning and how.....

  1. prediction ,
  2. algorithms ,
  3. ML ,
  4. deep learning ,
  5. research ,
  6. machine learning ,
Read More
thumbnail
Monday, December 23, 2024

Bridging probability and determinism: A new causal discovery method presented at NeurIPS

MBZUAI research shows how a better understanding of the relationships between variables can benefit fundamental scientific research.

  1. student ,
  2. determinism ,
  3. variables ,
  4. casual discovery ,
  5. neurips ,
  6. research ,
  7. machine learning ,
Read More
thumbnail
Monday, December 16, 2024

Web2Code: A new dataset to enhance multimodal LLM performance presented at NeurIPS

A team from MBZUAI used instruction tuning to help multimodal LLMs generate HTML code and answer questions.....

  1. machine learning ,
  2. instruction tuning ,
  3. code ,
  4. multimodal ,
  5. llms ,
  6. dataset ,
  7. neurips ,
Read More