Skip to main content

Exploring the human gut microbiome at strain resolution

Periodic Reporting for period 4 - MicrobioS (Exploring the human gut microbiome at strain resolution)

Reporting period: 2021-01-01 to 2021-12-31

With the genome sequencing of hundreds of bacterial isolates per day and a vast and growing number of metagenomics sequencing projects on gut microbiomes in healthy and diseased people all over the world, it becomes feasible to explore the microbial diversity in us not only at the level of genera and species but at strains. As two different strains of a prokaryotic species might only share 40% of the genes and can also vastly differ in single nucleotide variation (SNV), many aspects of a proper understanding of the microbial communities we host in the gut might only be revealed at this high resolution level. This proposal aims (i) to develop a robust methodology to characterize the SNV and gene content landscape from metagenomic shotgun data (ii) to explore patterns of variation in the human population to stratify geographically, but also in subpopulations such as families and to understand dispersal and the evolution of microbial strains as well as iii) to work towards medical applications, for example by monitoring fecal microbiota transplantation (FMT) at strain resolution or monitoring particular strains of interest in the population.
We were able to achieve almost all of the envisioned tools and applications of the proposal. Regarding tool developments, we have published a single nucleotide analyser, metaSNV (Costea et al., PlosOne 2017, Van Rossum, Bioinformatics 2022), improved quality control for metagenomic assembled genomes to distinguish strains (Orakov et al., Genome Biol. 2021), devised methods for gene content analyses of conspecific strains as well as subspecies delineation protocols (Costea et al., Mol. Sys. Biol. 2017, Van Rossum, Bioinformatics 2022) and developed many other analysis tools and resources that are required to zoom into strain resolution (e.g. the bacterial genome resource proGenomes: Mende et al., NAR 2017 and NAR 2019; an update of our eggnog (meta)genome annotation resource: Huerta-Cepas et al., Mol Biol Evol. 2017 and NAR 2019; and a tool to annotate newly sequenced (meta)genomes: Huerta-Cepas et al., Mol Biol Evol. 2017 and Cantalapiedra et al., Mol Biol Evol. 2021). Regarding the identification of known strains in metagenomics data, we have developed respective pipelines that led, as proposed, to various biological discoveries. For example, we have, as intended, identified and quantified subspecies in the vast majority of abundant and prevalent human gut microbes, which have a distinct biogeography and which seem to be exclusive and stay with the host for a long time (Costea et al., Mol. Sys. Biol. 2017, Van Rossum et al., Bioinformatics 2022, Van Rossum et al., Nature Rev. Microbiol. 2020). We pointed to the need and power of strain level resolution in metagenomics (Schmidt et al., Cell 2018) and illustrated it with various findings, e.g. basic ones like extensive strain transmission along the gastrointestinal tract (Schmidt et al., elife 2019) or in relation to medication (Forslund et al., Nature 2021, Hildebrand et al., Gut 2019), or diseases in general (Parkinson’s, Bedarf et al., Genome Med. 2017 and pancreatic cancer, Kartal et al., Gut 2022). Furthermore, as intended, we analysed longitudinal data using SNV markers in medically relevant areas such as FMT (Li et al., Science 2016; Schmidt et al., bioRxiv 2021, Nat.Med. in revision) or in medically important species such as C.difficile (Ferretti et al., bioRxiv 2022, Nature Med., submitted) and explored, using strains, family strain exchange over time in a biogeographic context (Korpela et al., Genome Res. 2018, Hildebrand et al. Cell Host Microbe 2021). The latter enabled us to also to progress on the population genetics part in that we discovered different resistance and dispersal strategies (Hildebrand et al., Cell Host Microbe, 2021).
We could demonstrate that the exploration of metagenomics data subordinate of species is very powerful and can yield basic knowledge, but also can be utilized in clinical applications. This applies to cross-sectional as well as longitudinal data of healthy and diseased individuals to understand strain influxes and mutational processes including gene content changes. We quantified strain transmission between individuals and body sites and have shown that known pathogenic strains can be reliably identified above a technically required abundance level. We to derived reliable operational strain definition that fits the data we observed to enable robust microbiome research projects at strain level.