Periodic Reporting for period 4 - CausalStats (Statistics, Prediction and Causality for Large-Scale Data)
Berichtszeitraum: 2023-04-01 bis 2024-03-31
Understanding cause-effect relationships between variables is of great interest in many fields of science. However, causal inference from data is much more ambitious and difficult than inferring (undirected) measures of association such as correlations, partial correlations or multivariate regression coefficients, mainly because of fundamental identifiability problems. A main objective is to exploit advantages from large-scale heterogeneous data for causal inference where heterogeneity arises from different experimental conditions or different unknown sub-populations. Questions about causal relations arise in very many branches of science, and prioritization of causal variables is crucial for the design of experiments in natural sciences. Statistical results and predictions should be robust against unwanted variation and perturbations. As another main objective, we show that the causality-inspired techniques developed in the project enhance such robustness and improve the degree of generalizability from one study (or dataset) to new ones, and we develop new insight about domain adaptation from a causality perspective. The new methodologies are used and shaped by interdisciplinary collaborations in bio-medical sciences.
More interpretable and reliable artificial intelligence (AI) will bring immediate benefit to the society, for example for robust health care monitoring. The current project has an immediate impact on AI systems to which our society is depending on.
(1) Causal regularization and corresponding distributional robustness. We achieved a major result about the duality of causal regularization, encouraging certain invariances over different sub-populations, and distributional robustness with respect to worst case expected loss when the test data distribution is perturbed: this is developed in Rothenhäusler et al. (2021) for linear (structural equation model) systems and extended and put into broader context with nonlinear models in a discussion paper in Statistical Science (Bühlmann, 2020a,b). Also, the invariance idea has been successfully transferred to the area of blind deconvolution with independence component analysis: the paper by Pfister et al. (2019) contains the novel methodology and theory and demonstrates its use for brain EEG signal processing and a deconvolution problem in climate science. The work around invariance and causality has also appeared in a perspective article in PNAS (Bühlmann, 2020c).
(2) Deconfounding in high-dimensional problems. We achieved a strong result how to adjust for unobserved latent and dense confounding in high-dimensional linear models and nonlinear models. The technique is based on spectral transformations of the data, by transforming the singular values of the matrix of covariates. The method is shown to achieve the same convergence rate for statistical accuracy as if there were no hidden confounders. The implications and potential applications are wide-ranging, essentially for any high-dimensional data analysis (e.g. in genetics, genomics or climate science). The first main methodological and mathematical contribution is published in Cevid et al. (2020). In Guo et al. (2022), we advance the method for uncertainty quantification: again, under a dense confoudnig assumption one can do asymptotically as well as without confounding variables. The main contribution has been brought into broader context by Bühlmann and Cevid (2020).
(3) Massive computational gains in change point detection. At the heart of many problems with the analysis of heterogeneous data is the segmentation or partitioning into homogeneous parts. We have developed massive computational speed-ups for such partitioning, known as change point detection. The different methods including software are worked out in Londschien et al. (2020, 2023), and Kovacs et al. (2020, 2023).
(4) Statistical machine learning. We made contributions to double machine learning (Emmenegger and Bühlmann, 2021, 2023) and we created the Distributional Random Forests algorithm (Cevid et al., 2022) and analyzed its further mathematical properties for statistical inference (Näf et al., 2023).
(5) Goodness of fit of causal models. We pioneered new approaches for goodness of fit for linear and nonlinear causal models (Schultheiss et al. 2023; Schultheiss and Bühlmann 2023a, 2023b, 2024).
(6) Related interdisciplinary work. Causal invariance techniques have been used in Pfister et al. (2021) and Williams et al. (2022) for the molecular landscape of the aging mouse; deconfounding has been used in Jablonski et al. (2022) for identifying cancer pathway dysregulation.
A breakthrough result has been achieved by Yuansi Chen, by then a postdoc paid by the ERC grant, who "essentially" solved the KLS conjecture (Chen 2021), see also https://www.quantamagazine.org/statistics-postdoc-tames-decades-old-geometry-problem-20210301/(öffnet in neuem Fenster) This result came as a surprise and has nothing to do with the formulated goals of the project. However, this demonstrates impressively that outstanding statisticians and mathematicians are able to contribute beyondt their area of original expertise!