Skip to main content



Summary description of project objectives
Of ~500 anopheline mosquito species, only about two dozen transmit human malaria, with vectorial capacity varying greatly among even very closely-related mosquito species, making the understanding of what defines an effective malaria vector critical to developing successful controls. The complete genome sequencing of multiple Anopheles mosquito species, coordinated by the Anopheles Genomes Cluster Committee ( facilitates the investigation of the genetic basis underlying these phenotypic variations and other characteristics such as insecticide resistance and chemosensation. This project aims to develop and employ computational strategies to interrogate multiple mosquito genomes for patterns of natural selection shaping the repertoire of functional genomic elements governing mosquito biology. The specific objectives can be summarised by three major goals over the course of the project: 1) conservation analysis to identify functional genomic elements; 2) divergence analysis to study gene and genome evolution; and 3) functional analysis to validate and characterise novel biological hypotheses.

Description of work performed since beginning of project
Recognisable biological functions are encoded by interactions among a variety of factors including protein-coding genes, non-protein-coding RNAs, and conserved non-coding elements. The discovery of these interactions relies on thorough characterisation of evolutionary patterns and comprehensive probing of biological functions, which in turn rely on primary analyses that focus on accurate identification, developing computational tools to recognise characteristic features and predict the full complement of functional genomic elements. Therefore, the primary analyses performed since the beginning of the project have focused specifically on the objectives of building multiple whole genome alignments and employing conservation analyses to identify functional genomic elements. A pre-requisite for these analyses was to first assess the quality of the new genome assemblies to highlight any potential errors that may need correcting before progressing with the alignments. Following the quality control steps, appropriate methods were selected that take into account the genomes sizes and the evolutionary distances between the mosquito species concerned, as well as the practicalities of running the computational analyses. The successful building of the multiple whole genome alignments for each species provided the principal input data required for comprehensive evolutionary signature analyses to scan the alignments to identify patterns of sequence changes that highlight regions with significant protein-coding potential, as well as constrained genomic regions that may harbour non-coding genes or regulatory elements. The work performed to achieve these objectives was approached in a stepwise manner as each set of new genome assemblies was released by the sequencing centre at the Broad Institute and subsequently made available with their gene annotations at VectorBase, the bioinformatics resource for invertebrate vectors of human pathogens.

Description of the main results achieved so far
Genome assembly assessments required the development of approaches to assess the qualities of genome assemblies in terms of a) relative completeness - using universal single-copy orthologs, and b) relative contiguousness - using conserved blocks of syntenic orthologs. These complementary genome assembly assessment approaches proved very popular with researchers working on other insect genomes when I presented them at the Arthropod Genomics Symposium at the University of Notre Dame, Indiana, in June 2013. A preliminary analysis toolkit has been made available to the community with the aim to develop this toolkit into a published resource. Building whole genome alignments required the selection and implementation of approaches to achieve whole genome alignments across the Anopheles phylogeny as well as tools for their analysis and visualisation. A very important pre-alignment step was to mask the repetitive regions of each genome as the many matches that occur between repetitive DNA cause the pairwise alignments to take a prohibitively long time to compute. Starting with the first available sets of new mosquito genomes, and repeating as the new genome assemblies became available, I iteratively improved the pipeline for building multiple whole genome alignments to produce a reference alignment for each species. The first round included 11 Anopheles genome assemblies: the total aligned basepairs per species range from ~100Mbp for the most distantly-related species to ~200Mbp for the most closely-related species. The second round of alignments included a total of 14 assemblies, and with the release of the last 7 genome assemblies in November 2013, 21-way alignments were built. The total aligned proportion of each assembly ranges from 55% to 82%, with members of the closely-related gambiae complex averaging 73%, and on average 14Mbp from each assembly can be aligned across all 21 assemblies. As the results from the each round of whole genome multiple sequence alignments were completed, I have made them available to all the members of the Anopheles Genomes Cluster Consortium. Specifically, a web-based tool to lookup genes of interest and view the corresponding region of the alignments was created that presents conservation levels by highlighting conservative, synonymous, and radical, non-synonymous substitutions relative to the reference sequence as well as gaps and frameshifts. Additionally, I developed a computational pipeline to scan the whole genome alignments in all six reading frames to be able to identify regions with significant protein-coding potential that could correspond to coding exons of annotated as well as novel genes. In complementary analyses, I focused on quantifying evolutionary constraint across the alignments to identify additional functional elements as significant constraint over a genomic region suggests that it is functionally important, and if the region does not exhibit significant protein-coding potential then it is likely to be either a non-coding RNA gene or a regulatory feature such as a transcription factor binding site. Although the first stage of the outgoing phase focused on objective 1 - conservation analysis - considerable progress has also been made towards achieving the aims that underlie objective 2 - divergence analysis. This involved the predicted functional elements being subjected to evolutionary characterisation in terms of their relationships both within and between genomes. These include delineation of gene ancestries and family classifications where reliable correspondences facilitated investigation of evolutionary dynamics of gene families and genome architectures and facilitate confident functional inferences.

Expected final results and their potential impact and use
Limiting the damaging effects of insects on human health and agriculture has traditionally involved pesticide-based controls, often with variable and declining success. Novel approaches require detailed biological understanding to facilitate targeted interventions while minimising ecological knock-on effects. The results from the first stage already show how this project directly aids in the translation of a wealth of genomic data into improved biological understanding, e.g. conservation-informed improvements to automatic gene predictions and evolutionary-informed inferences of gene functions. Employing comparative evolutionary genomic approaches followed by functional validation and exploration with a focus on the underlying genetic basis of vector traits, this project aims to reveal novel mosquito biology that will significantly advance disease control strategies and contribute to the development of innovative approaches to tackle global health challenges.