Periodic Reporting for period 1 - EvolInfome (The sources of historical signal in the genomes of birds)
Reporting period: 2021-09-01 to 2023-08-31
Among the many approaches that have been proposed for testing your ability to extract historical signals from genomes, the use of simulations has proven particularly promising due to its resemblance to an experimental setting. In addition, methods in machine learning, and in particular methods of unsupervised learning, provide the opportunity to extract the dominant signals in a broad range of data types. This project aimed to perform a detailed simulations study with molecular data evolving under a broad range of conditions, and across large numbers of genes, to examine their possible behaviour of genomic data sets under various statistical analysis frameworks (work package 1). In the second instance, the project aimed to build a software package that was easily accessible to researchers in biology, using methods of unsupervised learning as well as incorporating classical statistical statistical tests for finding the dominant signals of evolutionary rates in genomic data sets (work package 2). Using this novel framework of analysis, additional set of simulations will demonstrate the limitations of the proposed methods, as well as their power and usefulness for analysis of data of different sizes, and across the diversity of evolutionary scenarios (work package 3).
An important objective of the project was to join forces with the bird 10,000 genomes consortium, assessing their data efficiently under the proposed framework described above (work package 4). The framework was used for identifying lineages of birds with unusually fast or slow evolutionary processes, as well as the genes that have been most consequential for their evolutionary success. Overall, the project led to methodological advances in the analysis of molecular genomic data, as well as biological insights within one of the major genome sequencing consortia being led at the host institution. An additional outcome of the project is a long term collaboration between the hosts and the recipient on the development of novel methodological approaches, and the efficient usage of ever-increasing biological data resources. Briefly, the hosts provided world leading knowledge on large molecular genomic data resources while the recipient provided expertise on statistical methods development and analysis.
An extensive set of analyses has been performed in collaboration with the theme of the bird 10,000 genomes consortium. These analyses under my own developed frameworks have shed light on the limitations of these data and on the most promising parts of genomes for developing an advanced understanding of the evolution of birds. Specifically, the parts of genomes that are subject to strong selective constraints are those that code for protein products, and they are also the ones with the greatest heterogeneity in evolutionary processes. Due to this heterogeneity, these regions are also the ones that have the poorest signals and the greatest chances of leading to bias when extracting historical inferences. The best types of data where those known as intergenic regions, which occur in between coding regions and are likely to have undergone the least heterogeneity in evolutionary processes. This is likely to make them the most simple to model using computationally efficient approaches, add are traditionally implemented in evolutionary analysis.
Another type of insight from the bird data was related specifically to the evolutionary rates across taxa and across gene regions. Using the novel methods, developed and complete genomes from these bird species, this project undertook the largest scale analysis of evolutionary rate ever performed. The results have shown that particular linkages stood out for having distinct evolutionary processes. This was the case some of the very earliest lineage is of birds, which had dramatic changes in some of the most fundamental molecular machineries, these being the machineries that allow for the reading of DNA as well as in the machineries for the replication of DNA and for improved cardiac function that might be related to the ability of flight.
The wide ranging methodological advances and biological insights gained from this project are being disseminated in the form of peer reviewed publications, outreach in local museums and in teaching at the tertiary level.
In the case of bird evolution, the current project has made a leap both in the methodological approaches used and in the insights about bird biology that can be extracted from genomic resources. Future research will benefit from examining how individual taxonomic sub-groupings have developed unique genomic adaptations and changes. This will be possible crucially via the training and collaborations developed in this project. Additional research on the power of unsupervised machine learning methods will also provide fruitful advances that build from the methods developed in this project. While powerful, the methods develop here require substantial computational resources, such that future efforts might focus on greater scalability for the analysis of even larger data sets with thousands of whole genomes.