Periodic Reporting for period 1 - PHYLOBD (Accurate phylogenetic tree reconstruction using rate-heterogeneous birth-death models)
Période du rapport: 2021-11-01 au 2023-10-31
Phylogenetic inference is generally performed by using a molecular sequence alignment, which contains genetic information for all the samples, in combination with a substitution model, which describes the relative rates of mutation between different nucleotides, and a clock model, which describes the overall rate of mutation through time. Bayesian phylogenetic inference also adds a tree model, which represents the evolutionary dynamics of the lineages of the tree. This model ensures that the estimated phylogeny is consistent with the underlying evolutionary process, and allows the user to obtain estimates of important parameters, such as the speciation and extinction rates. In this project, I focus on two types of tree models, namely rate-heterogeneous models and models integrating fossils. Firstly, rate-heterogeneous tree models integrate variations in diversification between different lineages of a phylogeny. These variations are frequent in empirical datasets, and can be caused by environmental changes, phenotypic differences or even external processes such as sampling biases. However, rate variations are seldom accounted for in the existing literature, which could bias phylogenetic inferences and our understanding of the driving forces behind diversification processes. Secondly, fossils are a critical tool to establish accurate ages for phylogenetic trees, and several tree models have been developed to allow fossil samples to be directly integrated into a phylogeny. These models do not currently account for variations between lineages in either the diversification or the fossilization process, although these variations are likely to be even more prevalent in the fossil record as in extant species.
This project aimed at expanding the use of rate-heterogeneous tree models in Bayesian phylogenetic inference, through several means. My first goal was to establish the impact of integrating rate variations on simulated and empirical datasets, by comparing the results obtained using rate-heterogeneous or rate-homogeneous models. My second goal was to implement new post-processing and visualization tools adapted for rate-heterogeneous models, in order to make them more accessible for empirical users. Finally, my third goal was to integrate rate-heterogeneity into models designed for phylogenetic inferences using the fossil record, and to demonstrate the performance of these new models on simulated datasets.
In parallel, I used simulated datasets to compare the results of phylogenetic inference when using either a rate-homogeneous model or a multi-type model, which represents large variations in diversification rates. I showed that the choice of tree model has only a small impact on the inference of the phylogeny itself, but it affects the estimates of the diversification rates. In particular, using a rate-homogeneous model to estimate the phylogeny can lead to underestimating the uncertainty around estimates of the speciation and extinction rates. These results were presented at the congress of the European Society for Evolutionary Biology in August 2022.
I also developed a tool for summarizing the output phylogenies obtained from a Bayesian phylogenetic inference. This tool can import sets of phylogenies and combine them to more easily show clades which are strongly supported by the inference and clades which are more uncertain. It also offers several filtering options to summarize the main components of the estimated phylogeny. This tool is still in development and will be extended based on user feedback.
For the final objective of the project, I built a new combined model which includes both diversification rate variations and information from the fossil record. The new model is based on the existing multi-type model and was released to users as part of the MSBD package for BEAST2. I showed on simulated datasets that integrating fossil information provides more accurate estimates of both the phylogeny and the diversification rates, compared to an inference based only on extant species. I also demonstrated that the new model is more accurate than the rate-homogeneous model for phylogenies with fossils, when the true underlying process is variable. These results were presented at the Mathematical and Computational Evolutionary Biology conference in June 2023, and an article is in preparation.
I also supervised a master project by Nils Chabrol (ENS Lyon) who implemented a new model combining in a similar way fossil information with the ClaDS model. The new model will be demonstrated by building a new phylogeny of extinct and extant crocodylians, which is currently underway.
Overall, the results of this project benefit to all users of Bayesian phylogenetic inference, as well as studies which are not focused directly on phylogenetic inference but rely on a phylogeny as input data. Further work will build on the current results and expand our understanding of diversification processes in the tree of life.
Although the direct societal impacts of this project are limited, evolution is a cornerstone of our modern understanding of all areas of biology. Thus developing more accurate and efficient tools to analyze evolutionary processes has potential impacts far beyond the phylogenetic community.