Periodic Reporting for period 4 - SCARABEE (Scalable inference algorithms for Bayesian evolutionary epidemiology)
Reporting period: 2022-02-01 to 2022-07-31
Advances in sequencing technologies are providing an unprecedented opportunity to a detailed discovery of the mechanisms involved in the evolution and spread of microbes causing human infectious disease and in particular to elucidate the success factors behind multi-drug resistant bacteria. Simultaneously the developers of statistical methods have faced an enormous challenge to cope with the wealth of data brought by this opportunity. The rise of microbial Big Data gives a promise of a giant leap in making important discoveries, however, the previously existing statistical methods were neither able to cope with the size and complexity of the emerging data sets nor designed to answer the novel biological questions they enable. To fulfil the promise of giant leaps SCARABEE aimed at leveraging scalable inference methods by a unique combination of machine learning algorithms and statistical models for evolutionary epidemiology driven by population genomics. We focused on central biological questions about adaptation, epistasis, genome evolution and transmission of microbes causing infectious disease. The Big Data combined with the novel inference methods made it possible to answer a multitude of important questions that have previously been intractable or very challenging to solve in a reliable manner. Through close collaboration with the leading research centres in infectious disease epidemiology and genomics, the SCARABEE project aimed to considerably advance understanding of the evolution and transmission of numerous pathogens that pose a major threat to human health, which will be important for reducing their disease burden in the future.
Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far
We managed to develop the most scalable and powerful methods for analyzing bacterial population structure and performing genome-wide association studies (GWAS). GWAS is a generally utilized approach to discovery of the genetic architecture of traits in any living organism and it operates by associating measured variation in phenotypes with variation in genotypes using large population samples and statistical models. We introduced the first modular software platform for bacterial GWAS (pyseer) and later a new machine learning -based approach to GWAS which increased the statistical power significantly beyond the previous state of the art. The population structure analysis methods developed by SCARABEE scale to a million whole genomes and beyond. We further established the concept of genome-wide epistasis and co-selection study (GWES) which complements GWAS by allowing discovery of genetic architectures from selection traces without direct access to phenotypic measurements. Finally, we introduced and established ELFI (http://elfi.ai) as the leading software package for likelihood-free inference for interpretable simulator-based models. By making all SCARABEE methods available as open-source software, we ensured maximal opportunities for exploitation, dissemination and further development.
Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)
SCARABEE went significantly beyond the state-of-the-art along all of its main research axes. By combining the novel methods with the latest population genomic data, the project did significantly advance understanding about evolution, adaptation and transmission of multidrug resistant pathogenic bacteria. The methods and the data generated during the lifespan of SCARABEE have in summary set the next frontier for bacterial population genomics research.