Skip to main content

Scalable inference algorithms for Bayesian evolutionary epidemiology

Periodic Reporting for period 3 - SCARABEE (Scalable inference algorithms for Bayesian evolutionary epidemiology)

Reporting period: 2020-08-01 to 2022-01-31

Advances in sequencing technologies are currently providing an unprecedented opportunity to a detailed discovery of the mechanisms involved in the evolution and spread of microbes causing human infectious disease. Simultaneously the developers of statistical methods face an enormous challenge to cope with the wealth of data brought by this opportunity. The rise of microbial Big Data gives a promise of a giant leap in making causal discoveries, however, the existing statistical methods are neither able to cope with the size and complexity of the emerging data sets nor designed to answer the novel biological questions they enable. To fulfil the promise of giant leaps SCARABEE will leverage scalable inference methods by a unique combination of machine learning algorithms and Bayesian statistical models for evolutionary epidemiology. We focus on central biological questions about adaptation, epistasis, genome evolution and transmission of microbes causing infectious disease. The Big Data combined with the novel inference methods will make it possible to answer a multitude of important questions that remain currently intractable. Through our close collaboration with the leading research centres in infectious disease epidemiology and genomics we expect the SCARABEE project to considerably advance understanding of the evolution and transmission of numerous pathogens that pose a major threat to human health, which will be important for reducing their disease burden in the future. Large-scale biological data will be used to benchmark the developed methods, which will be made publicly available as free software packages to benefit the wide community of microbiologists and infectious disease epidemiologists.
We have consolidated our position as one of the leading developers of statistical methods for bacterial population genomics and advanced the field significantly during the SCARABEE lifespan. SCARABEE focuses on three main inference challenges for driving biological discoveries and quantification of models derived using biological theory: 1) mega-scale bacterial population genomic and phenotypic association models, 2), genome-wide epistatic models, 3) computationally intractable evolutionary and transmission dynamic models. For challenge #1, we have developed methods that made notable progress and now represent the state-of-the-art. Firstly, fastbaps software is the only model-based method for estimating population structure using genome-wide data that scales to at least hundreds of thousands of genomes. Its current algorithmic architecture enables analyses of data with up to 1 million genomes and 1 million observed SNPs, thus going two orders of magnitude beyond the previous methods available for haploid genome data. Secondly, pyseer is the leading bacterial GWAS method, building upon the earlier success of the SEER software. The first version of pyseer introduced the first comprehensive array of statistical models to perform GWAS in large-scale genome collection and allowed considerable extensibility. The second version of pyseer introduced a significant leap in computational scalability and for the first time the possibility of using biologically directly interpretable machine learning approach with pangenome-spanning penalized multiple regression models encompassing all the relevant population variation simultaneously. Both the simulations in this publication and independent work by others demonstrate that it is currently the most accurate method for bacterial GWAS.
For challenge #2, we have developed two methods that made notable progress on genome-scale epistasis analysis and currently represent the state-of-the-art of this field: SuperDCA and Genome-wide epistasis and co-selection study (GWES) using mutual information. The latter software has been downloaded over 1,000 times from GitHub after its publication in 2019. Using this method we have been able to identify from large-scale population genomic data sets highly probable and previously unrecognized candidates driving the successes of multiple major human pathogen species and lineages. These genomic discoveries are currently being validated experimentally and we expect these to be published in leading biological journals in the near future. For challenge #3, we have developed the general software platform for likelihood-free inference for simulator-based models that was published in a leading machine learning journal: Further development of ELFI is currently coordinated under the SCARABEE project and we are in the process of introducing several new inference methods that will greatly enhance its usefulness for applications in evolutionary epidemiology and transmission analysis. Some notable applications of the accelerated inference techniques developed by the Corander group and implemented in ELFI are the discoveries related to negative frequency-dependent selection (NFDS) acting on accessory genomic loci as a dominating force for population evolution for both Streptococcus pneumoniae and Escherichia coli: 1) Jukka Corander, Christophe Fraser, Michael U. Gutmann, Brian Arnold, William P. Hanage, Stephen D. Bentley, Marc Lipsitch, Nicholas J. Croucher (2017). Frequency-dependent selection in vaccine-associated pneumococcal population dynamics. Nature Ecology & Evolution, DOI: 10.1038/s41559-017-0337-x 2) Alan McNally, Teemu Kallonen, Christopher Connor, Khalil Abudahab, David M. Aanensen, Carolyne Horner, Sharon J. Peacock, Julian Parkhill, Nicholas J. Croucher, Jukka Corander (2019) Diversification of colonisation factors in a multidrug-resistant Escherichia coli lineage evolving under negative frequency-dependent selection. mBio, doi: 10.1128/mBio.00644-19 3) Caroline Colijn, Jukka Corander, Nicholas J. Croucher (2020) Designing ecologically-optimised pneumococcal vaccines using population genomics. Nature Microbiology, DOI: 10.1038/s41564-019-0651-y.
For all the three challenges listed above, we have developed the methods that made notable progress and went significantly beyond the state-of-the-art. We expect that the same trend continues for evert challenge until the end of the project. Significant biological results have been and will continue to be delivered using the SCARABEE methods.
Picture of the PI