Final Report Summary - MAIZEKEY (DNA extraction from ancient and modern maize samples and biochemical characterization of genes with key roles in domestication)
The first step consisted of choosing the regions of interest in the genome. Archaeological samples have a very low content of endogenous DNA, and many of our samples have as little as 1% of maize DNA, with the rest corresponding to environmental contaminants. The cost of producing a significant amount of full-genome data necessary for the downstream analysis is quite prohibited. We therefore opted by increasing the depth of coverage around specific regions of the genome using a capture approach. I chose the targets according to various criteria: I) GO category relevant for resistance to disease, stressful weather conditions, and nutrient content; ii) identity with sorghum between 70-95% (if too similar, then I expected the sequences within maize to be invariable, if too different it would render impossible the comparative analysis); iii) no hypothetical genes or without description; iv) only protein coding. Around 1Mb of sequence was captured using MYselect target enrichment kits. Sequencing was done in a Illumina HiSeq, and I designed and tested a pipeline for filtering and mapping the raw data. CutAdapt was used for adapter removal, PRINSEQ was used for quality trimming, bwa for mapping reads to the B73 RefGen_v2 reference genome, and only reads mapping to regions of mappability of 1 (calculated using gem-mappability) were used in the downstream analysis.
Although the enrichment for the targets regions was significant (from an average of 1X to 10X), the overall depth was still low for confident SNP calls. For this reason, I decided to use a new set of methods that take genotype uncertainty into account instead of basing the analysis on called genotypes, which is especially useful for low and medium depth data. Most of the methods have been implemented in the software ANGSD (http://popgen.dk/wiki/index.php/ANGSD(s’ouvre dans une nouvelle fenêtre)) and in ngsAdmix (http://www.popgen.dk/software/index.php/NgsAdmix(s’ouvre dans une nouvelle fenêtre)).
Given that population structure can lead to an inflation of the false positive rate in selection analyses I started by determining the admixture in the Tularosa samples using the maize HapMap2 data (http://www.panzea.org(s’ouvre dans une nouvelle fenêtre)) as comparison in ngsAdmix. I then moved on to perform various population genetics analyses to characterize variation within and between the two populations (e.g. Tajima's D and Fst) and detect genes with an outlier behavior that could be indicative of specific evolutionary constraints associated with domestication. The results of this analysis are being considered under an adequate demographic model that provides the results expected under a neutral scenario.
Maize is one of the three principal crops that feed the world. I have presented the preliminary results of this work in the “Maize meeting” in Illinois, USA, where around 600 researchers in maize genetics, both from academia and the industry, gather to discuss the latest developments in the field. The final results of this work are of great interest to the maize community, as they will shed light into how the primordial steps in maize domestication impacted the maize genome. Furthermore, our data analysis pipeline includes a highly innovative approach to the analysis of next generation sequencing data. The community showed particular interest in the applications of this to modern samples, since it allows an optimization of resources by allowing for a higher number of samples to be analyzed for the same amount of money.