Skip to main content

Machine learning for computational science:
statistical and formal modelling of biological systems

Final Report Summary - MLCS (Machine learning for computational science:statistical and formal modelling of biological systems)

Modern biology is increasingly being shaped by computational thought and practice. As with many sciences, novel technologies are enabling increasingly large-scale data collection. Such data has the potential to inform and constrain predictive mathematical models of biological systems, yet its scale, complexity and noise characteristics demand the development of sophisticated analysis methodologies, thus creating an increasing role for statistical machine learning within biological investigations. At the same time, a deep analogy between the functioning of biological and computational systems led to an increasing development of formal computational techniques within biology, thus providing a novel set of challenges to theoretical computer science. The MLCS project was conceived to study, advance and shape this convergence of two major areas of computer science.

The project outcomes can be broadly structured across three lines: foundational, statistical and application-oriented. Among the foundational outcomes, we pioneered the use of machine learning techniques such as Gaussian Processes within formal modelling techniques, such as model checking and model synthesis, in a number of papers in collaboration with Prof Bortolussi (Trieste) and Prof Bartocci (Vienna). In collaboration with Prof Hillston (Edinburgh), we developed ProPPA, the first process algebra within the probabilistic programming paradigm, thus providing an early example of formal programming languages involving machine learning.

At the statistical level, we developed novel methodologies for approximate solution and inference in stochastic processes. Highlights of this activity involved the development of a novel algorithm for Bayesian inference in continuous-time Markov chains (in collaboration with Prof Hillston), which uses random state-space truncation to achieve substantial increases in computational efficiency. In a separate collaboration with Dr Grima (Edinburgh), we were able to prove an approximate equivalence of stochastic reaction-diffusion processes with a statistically tractable class of spatio-temporal processes, thus obtaining the first practical methodology for parameter inference in this popular class of mechanistic spatio-temporal models.

At the application level, we produced several novel methods for biological data analysis and associated software tools, which were openly sourced in either R (through the Bioconductor portal) or Python (via github). This activity resulted in several high profile methodological papers in journals such as Nature Methods, Bioinformatics, Genome Biology, as well collaborative papers in top journals such as Science, PNAS, Nature Communications. We also distributed software tools implementing the methodological developments of the more foundational workpackages, including an open-source Python implementation of ProPPA, and a Java suite of tools for model checking and system design, U-check.