CORDIS - Forschungsergebnisse der EU
CORDIS
Inhalt archiviert am 2024-06-18

Integrating Pattern and Process to Reconstruct the Phylogeny of Genomes

Final Report Summary - GENEFOREST (Integrating pattern and process to reconstruct the phylogeny of genomes)

The foundations of modern evolutionary biology rests on two central ideas:

(i) all species are related to each other through a history of common descent; and
(ii) the delicate match between a species and its environment is explained by natural selection.

These two ideas embody the duality of pattern and process that is a fundamental feature of the evolution of life on earth: the unique historic pattern of descent and the interplay of population genetic and genomic processes that have generated it.

The GENEFOREST project's aim was to develop and apply phylogenetic methods that are capable of reconstructing the historic pattern of genome evolution across a broad phylogenetic range encompassing the three domains of life. Aiming to reconstruct the prevailing trend of genomic descent (the tree of genomes) together with inference of the genomic process that relate it to the histories of its constituent parts (the forest of gene trees). The motivation for constructing such models is that explicitly considering these processes, i.e. the history of gene transfers, duplications and losses permits the use of complete genomes, instead of a handful of genes. GENEFOREST proposed to apply the methods developed to large-scale reconstructions that enable the development of a valuable database of gene transfer, duplication and loss events, which can benefit a wide range of biological research.

During the first year of GENEFOREST, we developed the one-dimensional turbulence (ODT) model, an explicit probabilistic model of duplication, transfer and loss (DTL) that views the phylogenetic histories of homologous gene families (i.e. gene trees) as independent samples generated by the processes of DTL. We were able to use the ODT model to reconstruct the dated phylogeny of 36 cyanobacterial species based on over 8000 gene families. Aside of providing an accurate picture of the timing and pattern of speciations among cyanobacteria our results also presented a glimpse at the evolution of gene content over billions of years of evolutionary time. Our research was published in the Proceedings of the National Academy of Sciences of the United States (US).

At the same time, in collaboration with Bastien Boussau, we also developed PHYLDOG the first probabilistic method to simultaneously infer the species tree and the thousands of gene phylogenies that together constitute the history of genomes. Exploiting the power of parallel computing, and using one of the largest supercomputers in the world we reconstructed the evolutionary history of 36 mammalian genomes. Our results appeared in genome research.

During the second, final, year of GENEFOREST we developed an extension of the ODT model, which we named exODT. This is the first model of horizontal gene transfer that correctly treats gene transfers that involve speciation to, and evolution along extinct or unsampled lineages. This is an important contribution, because, as we demonstrate in the publication describing exODT, the overwhelming majority of transfers in fact involve evolution along extinct or unsampled lineages. Our results appeared in systematic biology.

In order to achieve GENEFOREST's goal of developing models that are applicable to large-scale datasets we combined the exODT model with conditional clade probabilities to derive the approximate likelihood estimation (ALE) model to efficiently infer improved gene trees given a species tree. Implemented in a parallel computing framework, the ALE approach makes joint inference of gene trees and species trees feasible for up to 100 genomes. Furthermore, simultaneous reconstruction does not only provide a single species tree, but also provides an ensemble of improved gene trees annotated with gene transfer, duplication and loss events that together render the historically unique tree of life as the emergent result of stochastic genome evolution processes. Our results are to appear in systematic biology.

We have successfully applied for 34 million hours of computing time on the CURIE supercomputer in the context of the seventh Partnership for Advanced Computing in Europe (PRACE) call to perform the necessary calculations.

During the course of the fellowships the fellow was able to use his experience in developing quantitative evolutionary models and preforming Monte Carlo simulations in an entirely new setting. The fellow also acquired significant expertise in both the theory of phylogenetics and the practice of its application to large-scale datasets. This acquired expertise has been instrumental in the fellow successful application for funding to start his own junior research group.