Researchers in a wide variety of fields try to understand how variables in data relate to each other. In finance, analysts might investigate whether the price of one asset influences or is independent of the price of another. In public health, researchers may be interested to determine how certain environmental or behavioral factors contribute to disease in specific populations.
It is, however, extremely difficult to determine causal relationships between variables and scientists are continually developing new statistical and machine learning techniques to improve this process. A key concept in this effort is known as conditional independence, which helps researchers figure out if one variable truly influences another.
Scientists have developed methods for determining conditional independence, but these approaches don’t work well in all situations.
Datasets often contain a mix of what are known as continuous variables, which are values that fall within a given range, and discrete variables, which are categories. Discrete variables can be inherently discrete — for example, if someone smokes or doesn’t — or they can be simplifications of underlying continuous variables.
Current conditional independence tests assume discrete variables are inherently discrete, explains Boyang Sun, a Ph.D. student in Machine Learning at MBZUAI. “But for variables like cancer severity, which are often recorded as ‘phase one,’ ‘phase two,’ etc., the underlying variable is actually continuous and using current methods in these cases can lead to misleading conclusions.”
To address this, Sun and colleagues at MBZUAI and other institutions developed a new conditional independence test that addresses an important yet overlooked scenario: cases where variables are inherently continuous but are represented in discretized form due to limitations in data collection. “This research addresses a fundamental problem in machine learning and statistics,” Sun says.
The team’s findings will be presented at the 13th International Conference on Learning Representations (ICLR), which will be held in Singapore later this month. Guangyuan Hao and Kun Zhang of MBZUAI contributed to the study.
Sun and his coauthors call their method DCT (discretization-aware conditional independence test), and it can determine the dependence of two variables when both are discrete, continuous, or when one is discrete and the other is continuous. “There are many different conditional independence tests that work on continuous data,” Sun says. “But there are no tests that correctly infer relationships between latent continuous variables when we only observe discretized versions of those variables.”
Determining if variables influence one another is important because in large datasets, it’s necessary to focus on the variables that matter. “If your graph is very dense, and you try to consider every variable, it won’t be tractable” due to the huge number of computations that are necessary, Sun explains. “The way to do this is to find the variables that are conditionally dependent, which makes the calculation simpler.”
The researchers tested their method on both synthetic and real-world datasets. Their first step was to attempt to recover the underlying, or latent, continuous variables from their discrete representations. The researchers did this by using statistical techniques to estimate the covariance of the two kinds of variables, essentially seeking to determine how accurately the discrete variables represented the latent continuous variables.
They then used a type of regression analysis, called node-wise regression, that tests for conditional independence based on the estimated covariance.
The team’s approach is built on the assumption that the latent continuous variables follow a normal, or Gaussian, distribution, which is a useful and common approach, Sun says. Assuming the latent continuous variables followed a Gaussian distribution allowed the researchers to calculate how the test results should behave if the variables were indeed independent.
Sun and his coauthors’ method outperformed other existing tests in detecting relationships between variables. It has the potential to be used in the field of causal discovery, which relies heavily on determining the conditional independence of variables. “If you are interested in causal discovery, you need to use a conditional independence test to sequentially test to remove edges, or connections, between variables to determine their real relationships,” Sun says.
Beyond the technical contribution, the research also shaped Sun’s perspective on the role of mathematics and deepened his understanding of its influence on the fields of machine learning and statistics. “It made me realize that mathematics is fundamental and extremely important, and yet there is so much left to solve.”
A team from MBZUAI aims to use disentangled representation learning to explore the hidden world uncovered by.....
A team from MBZUAI tested the performance of models in non-Western music, with the aim of making.....
Yuewen Sun's research seeks to identify latent causal variables that underlie different modalities and the relations between.....