Statistical Inference from Multiscale Biological Data: theory, algorithms, applications

Informations projet

SIMBAD

N° de convention de subvention: 101131463

DOI

10.3030/101131463

Date de signature de la CE 20 Octobre 2023

Date de début 1 Decembre 2023

Date de fin 30 Novembre 2027

Financé au titre de

Marie Skłodowska-Curie Actions (MSCA)

Coût total

Aucune donnée

Contribution de l’UE

€ 740 600,00

Coordonné par

POLITECNICO DI TORINO
Italy

Periodic Reporting for period 1 - SIMBAD (Statistical Inference from Multiscale Biological Data: theory, algorithms, applications)

Période du rapport: 2023-12-01 au 2025-11-30

High-throughput experimental techniques have transformed the life sciences, enabling quantitative investigation of biological systems across scales, from molecules and cells to patients and populations. These technologies generate massive, high-dimensional datasets, including protein sequence families, spatial maps of cellular microenvironments, heterogeneous clinical records, and large-scale epidemiological indicators. Beyond mechanistic insights, such data allow inference of previously unknown quantitative laws and organizational principles. This motivates inverse modelling: instead of building mechanistic models, statistical inference reconstructs the underlying probability distributions generating the data. These generative models support prediction, classification, and design—for example, predicting mutational effects, protein evolution, intercellular networks, disease progression, and epidemic dynamics. By focusing on essential features, inverse models are more generalisable than detailed, context-dependent models.

Reliable inference faces challenges due to high dimensionality, multiple scales, and sparse sampling: (i) strong, evolution-shaped heterogeneity; (ii) model selection under computational constraints; (iii) overfitting in undersampled regimes; and (iv) limited data annotation, requiring semi-supervised methods. SIMBAD addresses these issues by developing statistical-physics-based frameworks for generative modeling of heterogeneous, high-dimensional data. It devises inference algorithms robust to undersampling and overfitting and applies them to four domains: protein evolution and design, cellular metabolic state inference, digital contact tracing and epidemiological networks, and survival analysis in medicine. Despite diverse applications, all share the same theoretical bottlenecks, enabling broadly applicable, high-impact methods.

In Period 1, SIMBAD progressed well, aligned with objectives, timeline, and milestones. All Work Packages (WPs) advance, with minor delays from data access and experimental constraints, mitigated by proactive actions.

WP1 (Innovative Inference Algorithms) is ~40% complete, establishing core theory for inference without overfitting bias and applying it to molecular and directed network inference, laying foundations for modeling biological heterogeneity and generating publications. WP2 (Protein Sequence Landscapes) integrates theory and experiment, producing software for unsupervised generative modeling, advancing semi-supervised learning, and initiating mutational scanning work. Protein and pathogen evolution forecasting is advanced, with multiple publications. WP3 (Metabolic Networks) leveraged an early theoretical breakthrough to accelerate progress; genome-scale inference is nearly complete, phenotypic landscape mapping has produced publications, and work on persistent and drug-resistant states is upcoming. WP4 (Multi-Scale Epidemic Risk) is slightly behind due to delayed data, but methods for network inference, local risk prediction, and overfitting correction are progressing; mobility data integration resumes via new collaborations. WP5 (Complex Medical Data) is ~70% complete, with overfitting theory nearing validation and Bayesian Federated Inference for rare diseases advancing strongly, including applications to cancer data.

Overall, SIMBAD is delivering high-impact results in inference theory and applications across molecular biology, metabolism, epidemiology, and medicine, supported by strong publications and cross-WP integration.

SIMBAD has substantially advanced the state of the art by introducing theoretical and computational frameworks to infer, control, and predict complex biological systems under heterogeneity and undersampling. An early breakthrough in WP3 enabled inference in simplified yet rigorous metabolic frameworks, delivering software (D8) for high-throughput analysis of heterogeneous cell populations, providing unprecedented capabilities to quantify and interpret metabolic fluxes at single-cell resolution.

High-impact publications illustrate these advances. Communications Physics presented metabolic coordination and phase transitions in multicellular systems, revealing collective behaviour beyond classical network models. Science Advances demonstrated cross-feeding percolation transitions in intercellular metabolic networks, linking metabolism, topology, and critical phenomena. Nucleic Acids Research showed that periodic signalling outperforms constant regulation in microRNA-mediated control, providing new insights into regulatory efficiency.

These results move inference from descriptive modelling to predictive, controllable, and generative frameworks, showing that major obstacles—heterogeneity, undersampling, overfitting, and lack of annotation—are solvable. Advanced, statistical-mechanics-based inference can transform complex datasets into predictive and interpretable models, informing real-world decisions rather than post-hoc analysis. This is directly relevant to EU priorities in health, digital transformation, and preparedness. In areas such as epidemic risk assessment, personalized medicine, biotechnology, and public health surveillance, reliable inference enables effective prevention, intervention, and resource allocation.

Periodic Reporting for period 1 - SIMBAD (Statistical Inference from Multiscale Biological Data: theory, algorithms, applications)

Télécharger Télécharger le contenu de la page