Periodic Reporting for period 1 - SIMBAD (Statistical Inference from Multiscale Biological Data: theory, algorithms, applications)
Berichtszeitraum: 2023-12-01 bis 2025-11-30
Reliable inference faces challenges due to high dimensionality, multiple scales, and sparse sampling: (i) strong, evolution-shaped heterogeneity; (ii) model selection under computational constraints; (iii) overfitting in undersampled regimes; and (iv) limited data annotation, requiring semi-supervised methods. SIMBAD addresses these issues by developing statistical-physics-based frameworks for generative modeling of heterogeneous, high-dimensional data. It devises inference algorithms robust to undersampling and overfitting and applies them to four domains: protein evolution and design, cellular metabolic state inference, digital contact tracing and epidemiological networks, and survival analysis in medicine. Despite diverse applications, all share the same theoretical bottlenecks, enabling broadly applicable, high-impact methods.
WP1 (Innovative Inference Algorithms) is ~40% complete, establishing core theory for inference without overfitting bias and applying it to molecular and directed network inference, laying foundations for modeling biological heterogeneity and generating publications. WP2 (Protein Sequence Landscapes) integrates theory and experiment, producing software for unsupervised generative modeling, advancing semi-supervised learning, and initiating mutational scanning work. Protein and pathogen evolution forecasting is advanced, with multiple publications. WP3 (Metabolic Networks) leveraged an early theoretical breakthrough to accelerate progress; genome-scale inference is nearly complete, phenotypic landscape mapping has produced publications, and work on persistent and drug-resistant states is upcoming. WP4 (Multi-Scale Epidemic Risk) is slightly behind due to delayed data, but methods for network inference, local risk prediction, and overfitting correction are progressing; mobility data integration resumes via new collaborations. WP5 (Complex Medical Data) is ~70% complete, with overfitting theory nearing validation and Bayesian Federated Inference for rare diseases advancing strongly, including applications to cancer data.
Overall, SIMBAD is delivering high-impact results in inference theory and applications across molecular biology, metabolism, epidemiology, and medicine, supported by strong publications and cross-WP integration.
High-impact publications illustrate these advances. Communications Physics presented metabolic coordination and phase transitions in multicellular systems, revealing collective behaviour beyond classical network models. Science Advances demonstrated cross-feeding percolation transitions in intercellular metabolic networks, linking metabolism, topology, and critical phenomena. Nucleic Acids Research showed that periodic signalling outperforms constant regulation in microRNA-mediated control, providing new insights into regulatory efficiency.
These results move inference from descriptive modelling to predictive, controllable, and generative frameworks, showing that major obstacles—heterogeneity, undersampling, overfitting, and lack of annotation—are solvable. Advanced, statistical-mechanics-based inference can transform complex datasets into predictive and interpretable models, informing real-world decisions rather than post-hoc analysis. This is directly relevant to EU priorities in health, digital transformation, and preparedness. In areas such as epidemic risk assessment, personalized medicine, biotechnology, and public health surveillance, reliable inference enables effective prevention, intervention, and resource allocation.