Statistics, Prediction and Causality for Large-Scale Data

Informazioni relative al progetto

CausalStats

ID dell’accordo di sovvenzione: 786461

Sito web del progetto

DOI

10.3030/786461

Progetto chiuso

Data della firma CE 8 Maggio 2018

Data di avvio 1 Ottobre 2018

Data di completamento 31 Marzo 2024

Finanziato da

EXCELLENT SCIENCE - European Research Council (ERC)

Costo totale

€ 2 184 375,00

Contributo UE

€ 2 184 375,00

2 184 375,00

Coordinato da

EIDGENOESSISCHE TECHNISCHE HOCHSCHULE ZUERICH
Switzerland

Periodic Reporting for period 4 - CausalStats (Statistics, Prediction and Causality for Large-Scale Data)

Periodo di rendicontazione: 2023-04-01 al 2024-03-31

The project deals with causlity inspired statistical machine learning, with wide-ranging implications from interdisciplinary applications to specialized areas in statistics and machine learning. The main contributions of the project are on developing new statistical methods, algorithms and mathematical theory on distributional robustness, stability and better generalizability of statistical and machine learning algorithms, and to causal inference in partuclar models. Increased robustness and interpretability in terms of causality of statistical methods and algorithms are linked to each other. This perhaps surprising link implies that our developed methods and theories have broad implications on highly important aspects of information extraction for potentially large-scale and big data.

Understanding cause-effect relationships between variables is of great interest in many fields of science. However, causal inference from data is much more ambitious and difficult than inferring (undirected) measures of association such as correlations, partial correlations or multivariate regression coefficients, mainly because of fundamental identifiability problems. A main objective is to exploit advantages from large-scale heterogeneous data for causal inference where heterogeneity arises from different experimental conditions or different unknown sub-populations. Questions about causal relations arise in very many branches of science, and prioritization of causal variables is crucial for the design of experiments in natural sciences. Statistical results and predictions should be robust against unwanted variation and perturbations. As another main objective, we show that the causality-inspired techniques developed in the project enhance such robustness and improve the degree of generalizability from one study (or dataset) to new ones, and we develop new insight about domain adaptation from a causality perspective. The new methodologies are used and shaped by interdisciplinary collaborations in bio-medical sciences.

More interpretable and reliable artificial intelligence (AI) will bring immediate benefit to the society, for example for robust health care monitoring. The current project has an immediate impact on AI systems to which our society is depending on.

The following main results have been achieved which are classified into different sub-areas.
(1) Causal regularization and corresponding distributional robustness. We achieved a major result about the duality of causal regularization, encouraging certain invariances over different sub-populations, and distributional robustness with respect to worst case expected loss when the test data distribution is perturbed: this is developed in Rothenhäusler et al. (2021) for linear (structural equation model) systems and extended and put into broader context with nonlinear models in a discussion paper in Statistical Science (Bühlmann, 2020a,b). Also, the invariance idea has been successfully transferred to the area of blind deconvolution with independence component analysis: the paper by Pfister et al. (2019) contains the novel methodology and theory and demonstrates its use for brain EEG signal processing and a deconvolution problem in climate science. The work around invariance and causality has also appeared in a perspective article in PNAS (Bühlmann, 2020c).
(2) Deconfounding in high-dimensional problems. We achieved a strong result how to adjust for unobserved latent and dense confounding in high-dimensional linear models and nonlinear models. The technique is based on spectral transformations of the data, by transforming the singular values of the matrix of covariates. The method is shown to achieve the same convergence rate for statistical accuracy as if there were no hidden confounders. The implications and potential applications are wide-ranging, essentially for any high-dimensional data analysis (e.g. in genetics, genomics or climate science). The first main methodological and mathematical contribution is published in Cevid et al. (2020). In Guo et al. (2022), we advance the method for uncertainty quantification: again, under a dense confoudnig assumption one can do asymptotically as well as without confounding variables. The main contribution has been brought into broader context by Bühlmann and Cevid (2020).
(3) Massive computational gains in change point detection. At the heart of many problems with the analysis of heterogeneous data is the segmentation or partitioning into homogeneous parts. We have developed massive computational speed-ups for such partitioning, known as change point detection. The different methods including software are worked out in Londschien et al. (2020, 2023), and Kovacs et al. (2020, 2023).
(4) Statistical machine learning. We made contributions to double machine learning (Emmenegger and Bühlmann, 2021, 2023) and we created the Distributional Random Forests algorithm (Cevid et al., 2022) and analyzed its further mathematical properties for statistical inference (Näf et al., 2023).
(5) Goodness of fit of causal models. We pioneered new approaches for goodness of fit for linear and nonlinear causal models (Schultheiss et al. 2023; Schultheiss and Bühlmann 2023a, 2023b, 2024).
(6) Related interdisciplinary work. Causal invariance techniques have been used in Pfister et al. (2021) and Williams et al. (2022) for the molecular landscape of the aging mouse; deconfounding has been used in Jablonski et al. (2022) for identifying cancer pathway dysregulation.

A breakthrough result has been achieved by Yuansi Chen, by then a postdoc paid by the ERC grant, who "essentially" solved the KLS conjecture (Chen 2021), see also https://www.quantamagazine.org/statistics-postdoc-tames-decades-old-geometry-problem-20210301/ This result came as a surprise and has nothing to do with the formulated goals of the project. However, this demonstrates impressively that outstanding statisticians and mathematicians are able to contribute beyondt their area of original expertise!

Our line of work has opened up a new approach to causal inference and distributional robustness, and with this a new approach of causality-inspired machine learning. We have demonstrated its possibilities and limitations in terms of foundational mathematical results as well as empowering interdisciplinary projects. A totally unexpected breakthrough was established by Yuansi Chen with his proof of the KLS conjecture in 2021 -- which was truly thinking beyond state of the art!

Causal influence diagram with target variable Y, observed covariates X, hidden variables H

Periodic Reporting for period 4 - CausalStats (Statistics, Prediction and Causality for Large-Scale Data)

Condividi questa pagina Condividi questa pagina sui social network

Scarica Scarica il contenuto della pagina