Skip to main content

Biologically-motivated probabilistic evolutionary models and their use for genomic analyses

Final Report Summary - REALISTIC CODON MODE (Biologically-motivated probabilistic evolutionary models and their use for genomic analyses)

In the post-genomic era, probabilistic molecular-evolutionary models provide the powerhouse for meaningful analyses of the vast amount of data generated using high-throughput sequencing technologies. while such models exist for various types of genomic elements, there is a dire need to introduce advanced models that will capture the evolutionary dynamics at multi-layer levels of genomic resolution and biological organization: from the level of mutation, via constraints at the DNA, RNA, and protein levels, and ultimately in relationship to the whole organism. In this proposal we aimed to tackle this endeavor. The objectives of the proposal were:
(1) Develop a set of multi-layer evolutionary models that alleviate the widely held assumption that the synonymous substitution rate is homogenous across all sequence sites, thereby allowing selective constraints acting at the nucleotide level to be detected.
(2) Establish a framework for the combined analysis of phenotype-genotype evolution thereby allowing the association of adaptive genomic changes to species’ traits.
(3) Apply the developed method to biological data to gain novel biological knowledge in specific systems.
During the time course of the project, we have successfully tackled all three objectives. First, in accordance with objective (1), we have developed a codon evolutionary model that allows for both among-site variability of synonymous and nonsynonymous substitution rates. This newly developed model captures selective constraints acting at both the protein level and the nucleotide level and further provides a likelihood framework for the inference of positive selection on a background of variability in the baseline DNA/RNA substitution rate. Using this methodology, we showed that variability of the baseline DNA/RNA substitution rate is a widespread phenomenon in coding sequence data of vertebrate genomes, most likely reflecting varying degrees of selection at the nucleotide level. Additionally, we showed that ignoring this variability results in a considerable amount of erroneous positive selection inference.
The developed model was applied to study the genome evolution of Human Immunodeficiency Virus type 1 (HIV-1) (objective 3 above). To this end, we constructed a genome-wide map of the selective forces operating against the fixation of synonymous substitutions across the HIV-1 genome. We detected twenty-one short linear stretches within the genome, which display significantly low rate of synonymous substitutions. The majority of the identified regions were mapped to previously known regulatory functional regions, or to overlapping open reading frames, validating our methodology as a general tool for predicting functional regions at the nucleotide level nested within protein-coding regions. Furthermore, we detected a number of Ks conserved regions that could not be assigned to a known function, and thus are predicted to be under selection due to a yet unknown reason. Mutagenesis experiments to reveal the function of one of these regions, located within the pol gene, were conducted but showed no clear effect on viral replication in cell culture.
In accordance with objective 2, we have developed a novel evolutionary model that allows the detection of specific sequence sites most likely associated with a certain phenotypic trait of interest. To this end, we have developed a mixture model that assumes the existence of two distinct site categories—one that is influenced by the trait and one that is not. We also formulated a null model thereby permitting explicit testing of the null hypothesis whereby the analyzed trait is not associated with the rate of evolution at any sequence site. When the null model is rejected, the sites associated with the analyzed trait can be inferred using empirical Bayes techniques.
Using simulations, we have investigated the statistical power, and the accuracy of the new method. Our results indicated that accuracy increases for larger tree size and a greater proportion of sites affected by the trait. Our simulations also indicated that the power of the method to reject the null hypothesis, as assessed using the likelihood ratio test, is high and increases with higher values of p with an acceptable type 1 error rate around the expected 5%.
During the time period of the project, I have established an independent research lab at the Department of Molecular Biology and Ecology of Plants, Tel Aviv University. During this period, twelve undergraduate students conducted research projects that are related to the proposed research. In addition, five graduate students and three post-doctoral fellows conducted various research projects related to computational biology and evolutionary research. Studies stemming from this proposal have been presented in several international conferences and a dozen of department seminars.