Skip to main content

Next Generation Causal Analysis: Inspired by the Induction of Biological Pathways from Cytometry Data

Final Report Summary - CAUSALPATH (Next Generation Causal Analysis: Inspired by the Induction of Biological Pathways from Cytometry Data)

The CAUSALPATH ERC Consolidator grant completed in 2020. The project made significant progress in several directions and opened new research paths.
- CAUSALPATH developed novel algorithms for learning causal models in an integrative fashion from heterogeneous studies, e.g. when the studies measure different sets of quantities (variables), under different selection criteria (selection bias), under the presence of latent confounding factors, and when studies measure different quantities (values missing by design). To do so it introduced a methodology that converts the causal discovery problem to a mathematical logic problem, a technique that is quite general and can solve complex causal discovery problems.

- CAUSALPATH applied causal discovery to single cell mass cytometry data to induce novel causal relations and biological pathways related to the immune system. The effort identified several challenges with mass cytometry data such as the detrimental effect of randomization. The fact that feedback is omnipresent in biological systems points to fundamental limitations of causal discovery methods based on tests of conditional independences.

- CAUSALPATH invented algorithms that learn causal models expressed as differential equations (mathematical models). Such models overcome the limitations mentioned above. It adopted to this setting a mathematical technique called “weak formulation” that converts the problem of learning a system of differential equations to ordinary regression. The technique allows algorithms for a well-studied problem to be applied to a new direction of open problems, that of learning causal dynamical systems.

- CAUSALPATH invented algorithms for standard machine learning feature selection problems. Feature selection is a first step in causal discovery. CAUSALPATH’s new algorithms scaled up feature selection to millions of features as predictive quantities. In addition, the new algorithms can identify multiple feature subsets that are equally predictive. Identifying all equivalent solutions to the feature selection problem is paramount when feature selection is used for knowledge discovery or causal discovery.

- In predictive analytics, one typically tries numerous algorithms and hyper-parameter values to fit the data. However, the performance estimate of the winning model is optimistically biased. This is conceptually equivalent to the multiple hypothesis testing problem in statistics. As part of CAUSALPATH a new statistical methodology that removes this bias was developed. The methodology could become the standard in predictive performance estimation for Automated Machine Learning (AutoML) tools that try thousands of algorithms and hyper-parameter value combinations.

- CAUSALPATH developed algorithms for selecting among a set of causal discovery algorithms and their hyper-parameter values to identify the best for a given dataset. The methodology could lead to Automated Causal Discovery tools.

- To allow integrative analysis of a large portion of biological data, CAUSALPATH created a large repository of uniformly preprocessed omics and automatically annotated data, called BioDataome. In addition, it has introduced a method for comparing the empirical distributions of high-dimensional, low sample datasets. The method was applied on BioDataome to create networks of datasets and corresponding studies that are statistical similar. The similarities point to similarities in the underlying biology between phenotypically different diseases.

- CAUSALPATH created public resources such as the MXM R package (for feature selection and causal discovery) https://cran.r-project.org/web/packages/MXM/index.html SCENERY (a platform for single cell causal analysis, http://scenery.csd.uoc.gr ), Biodataome (a large data repository dataome.mensxmachina.org) and Datascope (a platform for identifying similarities in the statistical patterns between different studies http://datascope.csd.uoc.gr:25000/)