Periodic Reporting for period 1 - PhyPPL (First use of probabilistic programming for hard problems in statistical phylogenetics)
Période du rapport: 2020-05-04 au 2022-05-03
The Marie Curie Action 898120 (PhyPPL) aimed to push the boundaries of what is possible in computational phylogenetics by adapting the method of universal probabilistic programming to the field. Through several stages of development, including proof of concept, implementation of adapted software, and practical application of the software, the ESR was able to develop a novel and previously un-attempted model with under 1,000 lines of code (anagenetic diversification rate shifts). This accomplishment is a significant achievement as comparable implementations in legacy frameworks often require 10x as much code.
The results of the study suggest that probabilistic programming is the best way to specify phylogenetic models, and the team's newly developed model provides a better explanation of the diversification history of birds. Specifically, the study found that many small shifts in the evolutionary rates along the branches of the evolutionary tree may explain diversification patterns better than burst-style changes in rates that had been hypothesized before. These findings have implications not only for the study of bird diversification but also for the broader field of evolutionary biology.
In the second stage of the project, we learned from the first stage and adapted universal probabilistic programming to our field. We created a novel compiler with a more efficient inference engine uniquely suited to the structure of the programs that we had analyzed before. This work resulted in the publication of the award-winning "Compiling Universal Probabilistic Programming Languages with Efficient Parallel Sequential Monte Carlo Inference" during the ESOP 2021 conference ([https://link.springer.com/chapter/10.1007/978-3-030-99336-8_2](https://link.springer.com/chapter/10.1007/978-3-030-99336-8_2)(s’ouvre dans une nouvelle fenêtre)). Another paper of which the ESR is co-author and which deals with improvements on the inference engine in the new compiler is in the later stages of preparation.
Third phase: generic PPLs did not reflect the needs of evolutionary biologists for a straightforward modeling framework. Therefore, we set out to create an ergonomic syntax for our efficient compiler named TreePPL ([https://treeppl.org/](https://treeppl.org/)(s’ouvre dans une nouvelle fenêtre)). The work has been presented during a workshop at KTH Sweden on December 14, 2023 (https://miking.org/workshop-2022(s’ouvre dans une nouvelle fenêtre)) and an upcoming release note paper is in preparation, of which the ESR will be the first author.
In the fourth phase of the project, we developed a novel macro-evolutionary model called anagenetic diversification rate-shifts under geometric Brownian motion (AnaDS-BDD). We implemented this model in the underlying framework of TreePPL and found that it outperforms its peers. Our model indicates that the processes of speciation and extinction are mostly influenced by very small phenotypic changes along the evolutionary lineages of birds, rather than by large splitting events as previously thought. We presented these results as a poster at the European Congress of Evolutionary Biology ESEB 2022 ([https://github.com/phyppl/eseb-2022-poster](https://github.com/phyppl/eseb-2022-poster)(s’ouvre dans une nouvelle fenêtre)). An academic article is in the final stages of drafting (first authorship of the ESR) and will be submitted for publication later in a few months.
As an additional side-output of the action during 2020, we used our tools to combat the Covid-19 pandemic. We developed an open-source tool to estimate the prevalence of Covid-19 cases using probabilistic programming called `prevestim` ([https://phyppl.github.io/prevestim/](https://phyppl.github.io/prevestim/)(s’ouvre dans une nouvelle fenêtre)). The tool was used to create the protocol for a "Seroepidemiologic study to detect antibody-mediated immunity against SARS-CoV-2 among residents and healthcare workers in the city of Plovdiv, Bulgaria, weeks 21-24, 2020." We presented this analysis as a contribution at the leading European epidemiologic conference ESCAIDE.
Exploitation and Dissemination of Results: The key outcome of this project is the development of TreePPL, a powerful tool that leverages the latest advancements in probabilistic machine learning to create a modeling framework tailored to the needs of evolutionary biologists. TreePPL is not only being used for scientific purposes during the project, but also beyond its completion. The ESR is utilizing the tool in his new role at the École normale supérieure in Paris, and it is also being shared with other researchers in Stockholm.
In addition to scientific papers and lectures, the ESR has also disseminated awareness about probabilistic modeling through a popular science YouTube channel. In the hour-long video interview ([https://www.youtube.com/watch?v=V0vAMUuR0H0&t=2036s](https://www.youtube.com/watch?v=V0vAMUuR0H0&t=2036s)(s’ouvre dans une nouvelle fenêtre)) he explains the concepts of probabilistic modeling in a clear and accessible way, making them accessible to a wider audience.
As the project continues, we anticipate that TreePPL will be used to construct even more sophisticated and accurate evolutionary models. We also expect the development of new applications of the method, beyond phylogenetics, to other fields in biology, ecology, and epidemiology. The impact of TreePPL may extend beyond the academic community, with potential implications for industry and public policy.
Furthermore, the PhyPPL project has wider societal implications. The project's approach to modeling evolution is grounded in probability theory and machine learning, which is becoming increasingly relevant in today's data-driven society. The project's methods and tools may have implications for fields such as finance, engineering, and artificial intelligence, as well as for public policy and decision-making.