Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Accurate phylogenetic tree reconstruction using rate-heterogeneous birth-death models

Periodic Reporting for period 1 - PHYLOBD (Accurate phylogenetic tree reconstruction using rate-heterogeneous birth-death models)

Reporting period: 2021-11-01 to 2023-10-31

Phylogenies represent the evolutionary links between different individuals, populations or species and show when these lineages have diverged. The reconstruction of phylogenies, or phylogenetic inference, thus allows researchers to precisely date the emergence of key features in the history of life or the introduction of specific diseases into a population. However, the use of phylogenies is not limited to the information which is directly encoded into them, as they can also serve as data for a wide range of downstream analyses, for instance to study the evolution of phenotypic traits and their correlations with each other or with other factors such as environmental conditions. These analyses depend critically on the correctness of the phylogeny used as input data, so mistakes and inaccuracies in phylogenetic inference impact many areas of evolutionary biology beyond phylogenetics itself.

Phylogenetic inference is generally performed by using a molecular sequence alignment, which contains genetic information for all the samples, in combination with a substitution model, which describes the relative rates of mutation between different nucleotides, and a clock model, which describes the overall rate of mutation through time. Bayesian phylogenetic inference also adds a tree model, which represents the evolutionary dynamics of the lineages of the tree. This model ensures that the estimated phylogeny is consistent with the underlying evolutionary process, and allows the user to obtain estimates of important parameters, such as the speciation and extinction rates. In this project, I focus on two types of tree models, namely rate-heterogeneous models and models integrating fossils. Firstly, rate-heterogeneous tree models integrate variations in diversification between different lineages of a phylogeny. These variations are frequent in empirical datasets, and can be caused by environmental changes, phenotypic differences or even external processes such as sampling biases. However, rate variations are seldom accounted for in the existing literature, which could bias phylogenetic inferences and our understanding of the driving forces behind diversification processes. Secondly, fossils are a critical tool to establish accurate ages for phylogenetic trees, and several tree models have been developed to allow fossil samples to be directly integrated into a phylogeny. These models do not currently account for variations between lineages in either the diversification or the fossilization process, although these variations are likely to be even more prevalent in the fossil record as in extant species.

This project aimed at expanding the use of rate-heterogeneous tree models in Bayesian phylogenetic inference, through several means. My first goal was to establish the impact of integrating rate variations on simulated and empirical datasets, by comparing the results obtained using rate-heterogeneous or rate-homogeneous models. My second goal was to implement new post-processing and visualization tools adapted for rate-heterogeneous models, in order to make them more accessible for empirical users. Finally, my third goal was to integrate rate-heterogeneity into models designed for phylogenetic inferences using the fossil record, and to demonstrate the performance of these new models on simulated datasets.
First, I implemented a new rate-heterogeneous tree model for Bayesian phylogenetic inference. Unlike the existing multi-type models, the new ClaDS model represents progressive, small changes in speciation and extinction rates rather than sudden transitions. It is thus a much better model for many empirical scenarios, for instance when the variation in diversification rate is driven by gradual changes in environmental conditions. The new implementation was released as an open-source extension for the popular phylogenetic inference software BEAST2. An overview of the new package and its application to the cetacean phylogeny was published in Systematic Biology.
In parallel, I used simulated datasets to compare the results of phylogenetic inference when using either a rate-homogeneous model or a multi-type model, which represents large variations in diversification rates. I showed that the choice of tree model has only a small impact on the inference of the phylogeny itself, but it affects the estimates of the diversification rates. In particular, using a rate-homogeneous model to estimate the phylogeny can lead to underestimating the uncertainty around estimates of the speciation and extinction rates. These results were presented at the congress of the European Society for Evolutionary Biology in August 2022.

I also developed a tool for summarizing the output phylogenies obtained from a Bayesian phylogenetic inference. This tool can import sets of phylogenies and combine them to more easily show clades which are strongly supported by the inference and clades which are more uncertain. It also offers several filtering options to summarize the main components of the estimated phylogeny. This tool is still in development and will be extended based on user feedback.

For the final objective of the project, I built a new combined model which includes both diversification rate variations and information from the fossil record. The new model is based on the existing multi-type model and was released to users as part of the MSBD package for BEAST2. I showed on simulated datasets that integrating fossil information provides more accurate estimates of both the phylogeny and the diversification rates, compared to an inference based only on extant species. I also demonstrated that the new model is more accurate than the rate-homogeneous model for phylogenies with fossils, when the true underlying process is variable. These results were presented at the Mathematical and Computational Evolutionary Biology conference in June 2023, and an article is in preparation.
I also supervised a master project by Nils Chabrol (ENS Lyon) who implemented a new model combining in a similar way fossil information with the ClaDS model. The new model will be demonstrated by building a new phylogeny of extinct and extant crocodylians, which is currently underway.
The new models built and implemented as part of this project expand the range of available diversification processes, allowing users to estimate phylogenies from new empirical datasets and test new evolutionary hypotheses. In addition, the simulation studies ran as part of this project provide detailed methodological guidance to researchers interested in using Bayesian inference on empirical datasets, helping them design their analyses with the most up-to-date methods, to obtain more precise estimates of phylogenies and evolutionary parameters.
Overall, the results of this project benefit to all users of Bayesian phylogenetic inference, as well as studies which are not focused directly on phylogenetic inference but rely on a phylogeny as input data. Further work will build on the current results and expand our understanding of diversification processes in the tree of life.
Although the direct societal impacts of this project are limited, evolution is a cornerstone of our modern understanding of all areas of biology. Thus developing more accurate and efficient tools to analyze evolutionary processes has potential impacts far beyond the phylogenetic community.
Summary of the two rate-heterogeneous models used in this project
My booklet 0 0