Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

The sources of historical signal in the genomes of birds

Periodic Reporting for period 1 - EvolInfome (The sources of historical signal in the genomes of birds)

Reporting period: 2021-09-01 to 2023-08-31

Genomes contain information about the past, but extracting the historical signal from large numbers of genomes remains a major challenge in the biological sciences. In particular, a problem arises when different species of taxa evolve at distinct rates and in addition their genes have heterogeneous signals due to biological process of selection and mutation. Yet, an accurate reconstruction of historical processes using genomes is bound to will bring substantial benefits to the biological sciences.

Among the many approaches that have been proposed for testing your ability to extract historical signals from genomes, the use of simulations has proven particularly promising due to its resemblance to an experimental setting. In addition, methods in machine learning, and in particular methods of unsupervised learning, provide the opportunity to extract the dominant signals in a broad range of data types. This project aimed to perform a detailed simulations study with molecular data evolving under a broad range of conditions, and across large numbers of genes, to examine their possible behaviour of genomic data sets under various statistical analysis frameworks (work package 1). In the second instance, the project aimed to build a software package that was easily accessible to researchers in biology, using methods of unsupervised learning as well as incorporating classical statistical statistical tests for finding the dominant signals of evolutionary rates in genomic data sets (work package 2). Using this novel framework of analysis, additional set of simulations will demonstrate the limitations of the proposed methods, as well as their power and usefulness for analysis of data of different sizes, and across the diversity of evolutionary scenarios (work package 3).

An important objective of the project was to join forces with the bird 10,000 genomes consortium, assessing their data efficiently under the proposed framework described above (work package 4). The framework was used for identifying lineages of birds with unusually fast or slow evolutionary processes, as well as the genes that have been most consequential for their evolutionary success. Overall, the project led to methodological advances in the analysis of molecular genomic data, as well as biological insights within one of the major genome sequencing consortia being led at the host institution. An additional outcome of the project is a long term collaboration between the hosts and the recipient on the development of novel methodological approaches, and the efficient usage of ever-increasing biological data resources. Briefly, the hosts provided world leading knowledge on large molecular genomic data resources while the recipient provided expertise on statistical methods development and analysis.
All of the objectives of the project have been successfully completed, and they are now undergoing preparation for publication. The multiple large scale simulation studies are available online in the form of preprints and in data repositories, such as is the case for simulations under various molecular rate processes and under various models of evolution as proposed in work packages one and three. A novel proposed framework for analysis of evolutionary rates data has been published so far in the form of a pre-print and is undergoing review for publication with peer review. The main results achieved describe the amounts of excessively high or low heterogeneity genomic evolutionary rates that lead to biases in our understanding of evolutionary processes. These insights were used to develop a novel method of evaluation of genomes that is highly robust for providing novel insights on evolutionary rates across lineages and genes, and is packaged as a software with a detailed tutorial for usage.

An extensive set of analyses has been performed in collaboration with the theme of the bird 10,000 genomes consortium. These analyses under my own developed frameworks have shed light on the limitations of these data and on the most promising parts of genomes for developing an advanced understanding of the evolution of birds. Specifically, the parts of genomes that are subject to strong selective constraints are those that code for protein products, and they are also the ones with the greatest heterogeneity in evolutionary processes. Due to this heterogeneity, these regions are also the ones that have the poorest signals and the greatest chances of leading to bias when extracting historical inferences. The best types of data where those known as intergenic regions, which occur in between coding regions and are likely to have undergone the least heterogeneity in evolutionary processes. This is likely to make them the most simple to model using computationally efficient approaches, add are traditionally implemented in evolutionary analysis.

Another type of insight from the bird data was related specifically to the evolutionary rates across taxa and across gene regions. Using the novel methods, developed and complete genomes from these bird species, this project undertook the largest scale analysis of evolutionary rate ever performed. The results have shown that particular linkages stood out for having distinct evolutionary processes. This was the case some of the very earliest lineage is of birds, which had dramatic changes in some of the most fundamental molecular machineries, these being the machineries that allow for the reading of DNA as well as in the machineries for the replication of DNA and for improved cardiac function that might be related to the ability of flight.

The wide ranging methodological advances and biological insights gained from this project are being disseminated in the form of peer reviewed publications, outreach in local museums and in teaching at the tertiary level.
This project represented a shift in the traditional methods of analysis of evolutionary rates in the biological sciences. The methods developed allow for computationally efficient yet detailed analysis of evolutionary rates from large genomic resources. Specifically, the methods developed will help identify the key genetic regions that have led to the biological diversity that we observe today. This brings benefits to our understanding of biodiversity in the form of the key genomic changes that have allowed flora, fauna, and microbes to be successful across evolutionary time, and which have allowed them to produce distinct features and traits are biological novel.

In the case of bird evolution, the current project has made a leap both in the methodological approaches used and in the insights about bird biology that can be extracted from genomic resources. Future research will benefit from examining how individual taxonomic sub-groupings have developed unique genomic adaptations and changes. This will be possible crucially via the training and collaborations developed in this project. Additional research on the power of unsupervised machine learning methods will also provide fruitful advances that build from the methods developed in this project. While powerful, the methods develop here require substantial computational resources, such that future efforts might focus on greater scalability for the analysis of even larger data sets with thousands of whole genomes.
Summary of project aims and outcomes.
My booklet 0 0