Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Scalable Learning for Reproducibility in High-Dimensional Biomedical Signal Processing: A Robust Data Science Framework

Periodic Reporting for period 1 - ScReeningData (Scalable Learning for Reproducibility in High-Dimensional Biomedical Signal Processing: A Robust Data Science Framework)

Reporting period: 2022-09-01 to 2025-02-28

Data science has quickly expanded the boundaries of signal processing and statistical learning beyond their accustomed domains. Powerful and complex learning architectures have evolved to distinguish relevant information from randomness, artifacts and irrelevant data. However, statistical and machine learning currently lack computationally scalable, tractable, and robust methods for high-dimensional data. Consequently, discoveries, for example, in genomic data can be the result of coincidental findings that happen to reach statistical significance. As long as groundbreaking advances in biotechnology are not accompanied by appropriate learning frameworks, valuable efforts are spent on researching false positives. This ERC project proposal develops a coherent fast and scalable learning framework that jointly addresses the fundamental challenges of drastically reducing computational complexity, providing statistical and robustness guarantees, and quantifying reproducibility in large-scale and high-dimensional settings. An unprecedented approach is developed that builds upon very recent work of the PI. The underlying concept is to repeat randomized controlled experiments that use computer-generated fake variables as negative controls to trigger an early stopping of the learning algorithms, thereby mitigating the so-called curse of dimensionality. In contrast to existing methods, the proposed methods are completely tractable and scalable to ultra-high dimensions. The gains of developing advanced robust learning methods that are computed ultra-fast and with tight guarantees on the targeted rate of false positives are enormous. They lead to new reproducible discoveries that can be made with high statistical power. Due to the fundamental nature and the broad applicability of the proposed learning methods, the impacts of this project extend far beyond the considered biomedical signal processing use-cases, benefitting all scientific domains that analyze high-dimensional data.
A new learning framework for FDR-controlled high-dimensional data analysis has been developed and specified for various statistical models (regression, dependent variables, grouped variables, gaussian graphical models, principal component analysis). The developed Terminating-Random Experiments (T-Rex) selector controls a user-defined target FDR while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the candidate variables and multiple sets of randomly generated dummy variables. A versatile FDR control theory has been developed, which allows for finite sample proofs, which is essential for high-dimensional data. The developed proof strategy that is based on martingale theory is not limited to FDR control. The strategy of deriving finite-sample bounds on the errors based on injecting dummies and analysing and mathematically modelling their behaviour shall be extended to other metrics. It will also form the basis of a new high-dimensional robustness theory. Two open-source software packages have been published on CRAN, each having more than 12,000 downloads (in September 2024). They enabled conducting FDR controlled variable selection with up to 5 million variables on a laptop by implementing advanced C++ functionalities, i.e. memory mapping etc. Two algorithms for biomarker extraction in cardiovascular signals have been integrated into the popular python package neurokit2. Multiple real-data biomedical use cases, such as, genome-wide association studies, human immunodeficiency virus type-1 (HIV-1) drug resistance analysis, breast cancer survival analysis, calcium imaging and cardiovascular signal analysis, have been addressed. Multiple new interdisciplinary collaborations have been established during this project.
For the first time, ever, with the developed T-Rex selector framework we could scale false discovery rate (FDR)-controlled machine learning to settings with millions of variables. FDR-controlled genome wide association studies (GWAS) on a laptop are now possible. This is a significant achievement and can be considered a breakthrough in high-dimensional learning. Finite sample proofs have been derived showing FDR control for high-dimensional data in settings, where this was not possible prior to our works. Beyond regression problems, we also developed a method with false edges rate control for high-dimensional Gaussian graphical models and adapted the T-Rex so that we can perform sparse T-Rex principal component analysis (PCA). This had not been planned in the original work plan and it which opens the door to many applications requiring sparse factor models. Beyond the biomedical use cases, which are the main focus of this project, reliable FDR-controlled index tracking has been enabled so that we can now reliably and automatically track non-stationary time-series (such as the S&P 500 index) over decades using few features. This opens the door to, e.g. selecting green climate friendly stock while statistically controlling the risk. Very recently, we have, expanded our methods to high-dimensional complex-valued data, which opens the door to many engineering applications, such as, multi-source detection, localization and direction of arrival estimation.
My booklet 0 0