High-throughput experimental techniques have transformed the life sciences, enabling quantitative investigation of biological systems across scales, from molecules and cells to patients and populations. These technologies generate massive, high-dimensional datasets, including protein sequence families, spatial maps of cellular microenvironments, heterogeneous clinical records, and large-scale epidemiological indicators. Beyond mechanistic insights, such data allow inference of previously unknown quantitative laws and organizational principles. This motivates inverse modelling: instead of building mechanistic models, statistical inference reconstructs the underlying probability distributions generating the data. These generative models support prediction, classification, and design—for example, predicting mutational effects, protein evolution, intercellular networks, disease progression, and epidemic dynamics. By focusing on essential features, inverse models are more generalisable than detailed, context-dependent models.
Reliable inference faces challenges due to high dimensionality, multiple scales, and sparse sampling: (i) strong, evolution-shaped heterogeneity; (ii) model selection under computational constraints; (iii) overfitting in undersampled regimes; and (iv) limited data annotation, requiring semi-supervised methods. SIMBAD addresses these issues by developing statistical-physics-based frameworks for generative modeling of heterogeneous, high-dimensional data. It devises inference algorithms robust to undersampling and overfitting and applies them to four domains: protein evolution and design, cellular metabolic state inference, digital contact tracing and epidemiological networks, and survival analysis in medicine. Despite diverse applications, all share the same theoretical bottlenecks, enabling broadly applicable, high-impact methods.