Skip to main content

Evolutionary genomics: new perspectives and novel medical applications

Periodic Reporting for period 3 - EvoGenMed (Evolutionary genomics: new perspectives and novel medical applications)

Reporting period: 2019-01-01 to 2020-06-30

Why is our genome the way it is? Why, for example, is it so very large? In understanding the answers to questions like these we hope to understand which parts of our genome are functional and why. Knowing which parts are functional can in turn could lead to improved diagnostics and to improved gene -based therapies.

We are particularly interested in a core idea in evolutionary genomics, namely that selection should be less efficient when populations are small. This has been hypothesised to explain why our genome is so large - selection is too weak in large bodied organisms to be able to prevent the spread - by chance - of insertions that are just a little bit bad for us. We want to see if this idea can be extended: if selection is weak and leads to a bloated genome, might our genome also be prone to errors and if so, does this mean that selection in us is commonly on error mitigation devices?

One result of such selection to mitigate errors could be an increased role for what have been thought to be largely irrelevant parts of our genome. We focus on so-called silent sites - silent because it is thought that mutations at these sites have no impact on us. We have however shown that there is selection on such sites/mutations. Why is this? In understanding this can we make better new genes to help disease-bearing patients and can we improve diagnosis?

The objectives of the project are thus
- to examine the role of error in evolution - both as a means to cause selection to prevent it and as a means to the evolution of novelty.
- to go from understanding the relationship between error prevention and selection on synonymous sites and so as to improve both diagnostics and the ethology of disease
- to go from understanding of errors and innocuous mutations to improve therapeutics both by improving new genes and by defining sites in the genome where these new genes are less likely to cause knock-on errors by affecting the expression of neighbours.

This work is of societal relevance not just because has the potential to impact on medicine directly, but because we are also asking fundamental questions about how evolution works and, philosophically, what it is to be human. Are we are perfect genetic machine or a barely adequate error prone product of inefficient selection?
This project aims to better define the roles of genetic errors to address fundamental questions in evolution: a) determine the nature of error proofing devices (WP1), b) to determine the commonality of such devices, most especially whether they are more common when population size is low (and possibly hence error rates are high) (WP1), c) to appraise the role of errors in the generation of novelty (WP4). In addition, we aim to apply this knowledge d) to make better transgenes (WP3), e) to improve diagnostics (WP2) and f) to define safe harbour zones for transgene insertion (WP4).

A focus of our interest has been exonic splice enhancers, these being exonic motifs that act to reduce the rate of missplicing. Understanding their distribution within and between genomes is central to aims a and b (WP1). Using this information is central to aims d and e (WP3 and WP2 respectively). If selection is acting to preserve ESEs (and intraRNA protein binding sites more generally) we expect signatures of this in SNP profiles and in interspecific conservation profiles (WP1 underpinning WP2). Having highlighted the apparent disconnect between estimates of the impact of synonymous mutations on splice disruption derived from evolutionary and experimental approaches [1], we have subsequently made a large step forward in squaring this circle having determined for the first time both the commonality and strength of selection on mutations that disrupt error controlling ESE [2] (WP1, underpining WP2). This established common and strong selection, consistent both with the experimental data and for an important role of misplicing in human disease (WP2). Consistent with this we estimated 25-45% of all diseases are associated with missplicing [3] (WP1 and underpining WP2). We have in addition, applied a simple ESE disruption metric to establish whether disease-associated mutations might act via splicing disruption [4] (WP2) and analysed the extent to which mutations in tumours disrupt splicing [5] (WP1 underpining WP2). In addition, we have shown that selection is not simply to preserve motifs – there is also selection to avoid inappropriate binding of RNA-Binding proteins, indicative of selection to prevent errors [6] (WP1, underpining WP2).

Beyond the employment of the information for diagnostics (WP2), we have been active in converting the information to make better transgenes for gene therapy (aim d, WP3). First, by quantitative analysis of the impact of exonic splice enhancers (ESEs) on rates of synonymous site evolution in intronless genes, we have defined ESEs that will be needed in intronless transgenes [7], for reasons other than splice modification. We have developed a website (Enhance transgenes, not yet public) that enables the user to upload a gene sequence and that we will convert to an optimized transgene. The approach is to mimic human intronless genes in their site-specific GC content and allow the user to select several options, crucially whether to specifically ablate ESEs. A first approach has been trialed and implemented – giving better results than a commercial alternative [8]. The website development is complete (at least in first iteration) and full scale experimental benchmarking is underway.

While application of insights is important, a key novely of the program was to ask what genomic features might be adaptations to mitigate errors (WP1). We found for example, that over-use of the nucleotide A at CDS fourth sites is best understood as a trap for error-prone transcription initation as it permits immediate ribosomal rescue (NTGA becomes TGA, a stop) [9].

The theoretical greater novelty is the notion that error control is more important when the effective population size (Ne) is low, as low Ne gives higher error rates – this meaning, unusually, stronger selection when Ne is low, the opposite of the classical prediction from the nearly neutral model (WP1). We examined the role of errors in gene evolution as a function of effective population size by: a) considering the relationship between the degree of ESE usage as a function of effective population size, intron / density and splice site usage [10], b) by considering if anti-frameshift adaptations are handled differently when Ne is small and when large (WP1) and c) examined the hypothesis that biased gene conversion corrects genetic errors (mutations) by enforcing a bias (AT->GC) opposite to the mutation bias (GC->AT).
Our three tests all support the thesis that errors are more common when Ne is low but that in turn selection for control of errors and the downstream consequences of errors is higher. To consider gene conversion’s bias we provided a broadscale test of the hypothesis by close analysis of meiotic conversion in tetrads in four species. The hypothesis in question supposes that if mutation (heritable error) is biased in the GC->AT direction (as our [13-15] and other data suggest to be all but universal) gene conversion should counter balance and be biased AT->GC. Our expectation is that when Ne is low, mutation rates will be high (selection too weak) but that the conversion bias will be as a consequence strong. Indeed, our data failed to support the counter-balance model [11] in species with high Ne and low mutation rates (e.g. yeasts), but found it worked in species with low Ne and high (mutation) error rates, such as humans. This analysis thus suggests that low Ne is indeed associated with more errors and more selection for control of the consequences of errors.
Our evidence looking at out of frame stop codons suggests that errors are mitigated when Ne is low, while at high Ne selection is strong enough to prevent errors [12]. Similarly, ESE density is highest in species with many and large introns, which we showed is also associated with low Ne, providing strong evidence that the relationship between Ne and strength of selection isn’t necessarily as always assumed [10]. This same analysis indicated that within genomes selection for error control is highest when introns are large and splice sites weak (WP1 and WP2). In addition to the above, we have been estimating rates of other key error rates including the most important error rate, namely the mutation rate [13-15], along side rates of recombination rates [13, 14], thought to be an adaptation to enable purging of genomic errors (i.e. mutations). This will fit into a larger meta-analysis of the role of Ne in enabling high error rates (as predicted by the drift-barrier model) and thus in turn higher recombination rates. In this manner the evolution of sex debate can be unified with the novel framework tested here (WP1).
Errors are, almost by definition, chance events that are not pre-programmed (and hence not the result of selection). But accidents can also be the source of novelty. If this is right then we might expect novelty to emerge from situations where selection against the errors is weak and so the errors are tolerated (WP4) in the first instance. We addressed this by considering the problem of fast evolution of duplicate genes to determine the extent to which errors, such as gene duplication events, might be tolerated owing to minimal fitness of effects of duplications of some genes. Strikingly we found that fast evolution of duplicate genes is for the very opposite reason usually supposed – far from being evidence of duplicates being important, the evidence says “dull” genes that were always fast evolving are more prone to successful duplication as alterations in their dosage is of low impact [16].

This result suggests that novelty is produced when errors are initially of low impact enabling the error to go from tolerated to useful. This concords with our analyses of gene expression evolution. We have examined the role of errors in gene expression by considering the extent to which the evolution of the expression of one gene predicts the evolution of the expression of neighbours (WP4). We have implemented our phylogenetically explicit framework and shown in humans that piggy backing is the default mode of gene expression change over evolutionary time [17]. In particular we have focused on how this varies with gene density within and across species – as our model predicts when gene density goes up the domain size of piggy backing goes down, otherwise the errors in gene expression would be too catastrophic. This same result is key to understanding domains in our genome that are “insulated” from the expression of neighbours (WP4). We are using both comparative expression data and experimental data to define features of genes (and genomic domains) that lead to them being insulated, both in the immediate term and over evolutionary time, from their neighbours (WP4). We have initiated an experimental study of massive parallel transgene insertion in human iPS cells to determine the predictors of non-independent gene expression. This we will compare with trans-primate iPS RNASeq data to determine whether evolutionary insulation and functional insulation are coupled.

Along with duplicates, the piggy backing effects we suggest are a major source of novelty in gene expression as well. To address this we are employing single cell and pan primate iPS data to examine early human gene expression concentrating on the activity of HERVH and novel transcripts associated with its presence and expression (WP4). HERVH involvement is especially elegant as it provides in effect a massive natural transgne experiment in which the same element inserts throughout the genome potentially modulating neighbour gene expression. We have found that consistently in primates the pluripotency transcriptomes have been modified by HERVH adjusting expression of the multiple neighbours. Much of this is likely functionless and tolerated, but occassionally a novel transcript gets incorporated in the functional components of the pluripotency pathway (e.g. HERVH derived ESRG and lincROR). This work on new gene creation, led us to provide the logical basis for, and clarify pitfalls in, analysis of de novo genes [18]. Strikingly we have shown how analysis of new genes can enable extraction of human naïve stem cells [19].
Novelty need not simply mean novel and beneficial. Why for example are there non-transmissable diseases that are species specific? We have analysed one such. Preeclampsia is a disease of pregnancy that affects ~5% of mothers and remains a leading cause of maternal and fetal mortality. Curiously the disease is human specific. Why is this and might error be associated with its etiology? Examining the transcriptomes of numerous preeclamptic placentas we tested the hypothesis that errors in the regulation of imprinted genes might be the cause (imprinted genes are dose sensitive but prone to error as only one copy of the gene is expressed leading to easy over or under expression by epigenetic errors). We found there to be validity to this hypothesis and highlighted the role of an imprinted gene (DLX5) that is expressed in human placenta but not in other mammals. Thus a combination of recruitment of genes to the human placenta and expression errors seems to in part underpin the evolution of disease novelty [20].

Just as our fundamental work work led to the chance discovery of naïve stem cells, so too by chance we stumbled across a remarkable example of a genomic error – we discovered the first organism with two tRNAs for the same codon leading to errors when this codon is used. We showed that related species resolve the problem by loss of one of the two, but the focal species seems instead to tolerate it and employ the codon rarely, not in highly expressed genes and not at key protein locations [21].


1. Savisaar, R. and L.D. Hurst, Estimating the prevalence of functional exonic splice regulatory information. Human Genetics, 2017: p. 1-20.
2. Savisaar, R. and L.D. Hurst, Exonic splice regulation imposes strong selection at synonymous sites. Genome Research, 2018: p. (in press).
3. Wu, X. and L.D. Hurst, Determinants of the Usage of Splice-Associated cis-Motifs Predict the Distribution of Human Pathogenic SNPs. Molecular Biology and Evolution, 2016. 33(2): p. 518-529.
4. Casey, R.T. et al., SDHA related tumorigenesis: a new case series and literature review for variant interpretation and pathogenicity. Mol Genet Genomic Med, 2017. 5(3): p. 237-250.
5. Hurst, L.D. and N.N. Batada, Depletion of somatic mutations in splicing-associated sequences in cancer genomes. Genome Biology, 2017. 18.
6. Savisaar, R. and L.D. Hurst, Both Maintenance and Avoidance of RNA-Binding Protein Interactions Constrain Coding Sequence Evolution. Molecular Biology and Evolution, 2017. 34(5): p. 1110-1126.
7. Savisaar, R. and L.D. Hurst, Purifying Selection on Exonic Splice Enhancers in Intronless Genes. Molecular Biology and Evolution, 2016. 33(6): p. 1396-1418.
8. Thumann, G., et al., Engineering of PEDF-Expressing Primary Pigment Epithelial Cells by the SB Transposon System Delivered by pFAR4 Plasmids. Mol Ther Nucleic Acids, 2017. 6: p. 302-314.
9. Abrahams, L. and L.D. Hurst, Adenine Enrichment at the Fourth CDS Residue in Bacterial Genes Is Consistent with Error Proofing for+1 Frameshifts. Molecular Biology and Evolution, 2017. 34(12): p. 3064-3080.
10. Wu, X. and L.D. Hurst, Why Selection Might Be Stronger When Populations Are Small: Intron Size and Density Predict within and between-Species Usage of Exonic Splice Associated cis-Motifs. Molecular Biology and Evolution, 2015. 32(7): p. 1847-1861.
11. Liu, H.X. et al., Tetrad analysis in plants and fungi finds large differences in gene conversion rates but no GC bias. Nature Ecology & Evolution, 2018. 2(1): p. 164-173.
12. Abrahams, L. and L.D. Hurst, Refining the Ambush Hypothesis: Evidence That GC- and AT-Rich Bacteria Employ Different Frameshift Defence Strategies. Genome Biology and Evolution, 2018. 10(4): p. 1153-1173.
13. Liu, H.X. et al., Direct Determination of the Mutation Rate in the Bumblebee Reveals Evidence for Weak Recombination-Associated Mutation and an Approximate Rate Constancy in Insects. Molecular Biology and Evolution, 2017. 34(1): p. 119-130.
14. Wang, L., et al., Mutation rate analysis via parent-progeny sequencing of the perennial peach. II. No evidence for recombination-associated mutation. Proceedings of the Royal Society B-Biological Sciences, 2016. 283(1841).
15. Xie, Z.Q. et al., Mutation rate analysis via parent-progeny sequencing of the perennial peach. I. A low rate in woody perennials and a higher mutagenicity in hybrids. Proceedings of the Royal Society B-Biological Sciences, 2016. 283(1841).
16. O'Toole, A.N. L.D. Hurst, and A. McLysaght, Faster Evolving Primate Genes Are More Likely to Duplicate. Molecular Biology and Evolution, 2018. 35(1): p. 107-118.
17. Ghanbarian, A.T. and L.D. Hurst, Neighboring Genes Show Correlated Evolution in Gene Expression. Molecular Biology and Evolution, 2015. 32(7): p. 1748-1766.
18. McLysaght, A. and L.D. Hurst, Open questions in the study of de novo genes: what, how and why. Nature Reviews Genetics, 2016. 17(9): p. 567-578.
19. Wang, J.C. et al., Isolation and cultivation of naive-like human pluripotent stem cells based on HERVH expression. Nature Protocols, 2016. 11(2): p. 327-346.
20. Zadora, J., et al., Disturbed Placental Imprinting in Preeclampsia Leads to Altered Expression of DLX5, a Human-Specific Early Trophoblast Marker. Circulation, 2017. 136(19): p. 1824-1839.
21. Muhlhausen, S., et al., Endogenous Stochastic Decoding of the CUG Codon by Competing Ser- and Leu-tRNAs in Ascoidea asiatica. Current Biology, 2018. 28(13): p. 2046-+.
This project interfaces both fundamental evolutionary genetics and medicine. We have provided the first robust evidence that the correct view of the human genome is that it is bloated owing to weak selection, but in addition that this weak selection has led to more errors and in turn more error mitigation. Thus in contradiction to classical theory, selection - at least for error mitigation - can be stronger when populations are small. These results have a direct societal impact in reforming the notion of human perfection.

We have demonstrated the existence of a species with error prone translation owing to the presence of two tRNAs for the same codon. This breaks the last rule of genetic codes: in this species we cannot predict the proteome just knowing the genome as translation of one codon is stochastic.

As regards applications to medicine, the first application of our novel protocol to design new genes for gene therapy outperformed the commercially available alternative.

Our research into the evolution of error prone gene expression has led to us being able to isolate naive human stem cells and provide an improved growth medium for them.