Skip to main content

Modelling the genomic landscapes of selection and speciation

Periodic Reporting for period 2 - ModelGenomLand (Modelling the genomic landscapes of selection and speciation)

Reporting period: 2019-08-01 to 2021-01-31

Understanding how natural selection, random genetic drift and demographic events interact to generate and maintain genetic and species diversity has been the central focus of population genetics for many decades. We now have the necessary genome sequence data to make detailed and powerful inferences about the evolutionary past of populations and species, yet our ability to meaningfully interpret such data has remained fundamentally limited both by a lack of powerful and efficient statistical methods that allow extracting infromation about the evolutionary process from sequence data and a lack of meaningful and comprehensive comparative analyses of natural speciation processes.

This project uses a combination of theory, development of new inference tools and a large-scale comparative analyses of genome data and has two principal aims:

1) to develop a general, statistical framework for making inferences about the joint action of past selection and demography from genome sequence data. This will be achieved using analytic calculations and approximations for the joint distribution of linked polymorphic sites. We will use these results to develop new methods to quantify the genome-wide rates of positive and background selection and to scan for genomic outliers of divergence between and positive selection within species. The new methods will be tested using simulations and data from model insects (Heliconius and Drosophila).

2) to apply the new inference approach to genome data for 20 species pairs of European butterflies and conduct a systematic comparison of the demographic and selective forces involved in speciation. This will reveal how repeatable speciation processes are both in terms of the demographic and selective events, and the genes and genomic architectures involved. Specifically, we will test whether selection during speciation is concentrated at chromosomal rearrangements and/or candidate gene families involved in mate recognition and host plant adaptation.

This project seeks to fundamentally improve both our understanding of speciation and selection and our ability to use sequence data to study population processes (be they selection, demography or both) in any system.
The ERC core team has made good progress on all three work packages so far:

WP1: We have developed an open source tool for demographically explicit genome scans for reproductive barriers and have conducted a comprehensive example analysis on a well studied dataset from Heliconius butterflies. This was made possible by several new insights on the structure of the underlying likelihood calculation. The new method for genome IM inference using blockwise likelihood estimation (gIMble, was presented by the ERC core team (with external team member Kelleher, Oxford) during a hands-on demonstration at an SMBE satellite workshop on speciation genomics (~60 participants, Sweden 05/2019). An ERC funded PhD student has developed a simulation module that integrates, msprime, a state of the art coalescence simulator to conduct power analyses and parametric bootstraps.

WP2: We have implemented two coalescent approximations for selective sweeps in the generating function framework: the star-like approximation and the Yule approximation and have quantified (using forwards simulations) how well both approximate the distribution of genealogical branch lengths around a selective sweep target. Ongoing work on this project is quantifying the power of likelihood inference on genomic data to detect historic sweeps (either from de novo mutations or from introgressed beneficial variants) based on these analytic results. We also completed a simulation study showing how globally beneficial mutations can interfere with and slow down the process of local adaptation (led by a PhD student published in Proc Roy SocB 2020).

WP3: We have completed sample acquisition and the generation of PacBio based assemblies for 20 species pairs of European butterflies. Resequence data have been generated for 6 species pairs to date and the submission of extractions for the remaining 14 species pairs is scheduled for fall 2020. Sampling was achieved during a 2 week field trip to Romania and Hungary and two shorter trips to Spain (2018) and France (2019). Additional samples were obtained from external team members across Europe. We developed protocols for high molecular weight DNA extractions that maximise both molecule length and yield from individual butterflies and generated extractions for all focal species. We have implemented bioinformatic pipelines for genome assembly, polishing and annotation on a pilot dataset for four species. The assembly of reference genomes for all 20 pairs is ahead of schedule, i.e. initial assemblies have been generated for 18 species pairs. We have leveraged the RNASeq data for 18 butterfly species pairs to conduct a first comparative project on the speciation history. A surprising and encouraging result of this study (led by a PhD student, in review) is that the majority of sister species of European butterfly began to diverge in the Pliocene, and so are substantially older than previous studies based on mitochondrial data suggested. This will be a useful reference point for future comparative work on this model group.
The demographically explicit genome scanning tool for barrier to gene flow (WP 1) is scheduled to be submitted for publication at the beginning of the 2nd reporting period (i.e. end of 2020). Although we plan to develop and implement some natural extensions of this inference/analysis scheme (e.g. allowing for phased data and a wider range of demographic scenarios), this would complete WP1.

The analytic work on selective sweeps is at an advanced manuscript stage and we are aiming to submit this by the end of 2020. WP2 will involve substantial further computational work to link these analytic results to sequence data. We will focus on developing inference methods that allow a joint inference of demography (divergence and gene flow) and positive selection processes that are embeded in such non-equilibrium histories. In particular, we will focus on the detection of selective sweeps that occur at the onset of divergence between species/populations which would link WP2 to the scanning method developed in WP1 and introgression sweeps. We will explore these approaches using simulation in the first instance and conduct a comparison to existing SNP based sweep scans.

Further empirical work will involve a series of empirical papers on the speciation history (demographic and in terms of selection) of target sister species pairs of butterflies. We have made substantial progress in analysing genome wide variation in four pilot species pairs. Draft genome assemblies will be made available as genome notes upon completion. Completing WP3 will involve several comparative analyses that explore how repeatable speciation is at the level of the genome is by comparing barrier scans across replicate sister species pairs. We will also explore the interesection of changes in gene expression (and sex bias in such expression) and selection during speciation and will investigate the potential role of chromosomal fusions/fissions and inversions in the speciation process. We envisage that empirical projects on individual species pairs will be completed in 2021 and comparative analyses in the last two years of the ERC project.
Several butterfly sister taxa form contact zones - natural laboratories for speciation research
The genealogical history of a sample of genomes can be represented as a graph