The aim of this project was to initiate the sequencing of the nuclear genome of the higher plant Arabidopsis thaliana. The advantages of sequencing this plant include its widespread use as a model organism, a small genome size of approximately 100 Mbp, a wealth of genetic and molecular genetic studies aimed at understanding many aspects of plant development and disease resistance, a facile transformation system, and a close phylogenetic relationship with brassicas, an important class of crop plants. The goals of the project were to first establish YAC contigs covering chromosomes 4 and 5, as part of a US-EU collaboration to establish physical maps of all 5 chromosomes. Secondly, sequence-ready cosmid contigs covering two regions of chromosome 4 and various other regions of biological interest were to be derived, and methods set up to establish sequence-ready libraries covering all of chromosome 4 in preparation for future work. Thirdly, an informatics centre specialising in assembling existing Arabidopsis genome resources, analysing EST and genomic sequence, providing a quality control service, and establishing links for public access to sequence was to be set up. Fourth, a network of sequencers specialising in EST sequencing was to contribute 3000 unique EST sequences to public databases. Fifth, a network of labs sequencing genomic DNA , both in regions of individual interest and systematically on chromosome 4, was to be set up in order to produce 2.5Mb of genomic sequence and provide expertise to contribute to future sequencing projects. Finally, cooperative agreements were to be set up to ensure the completion of the genome by 2004.
The objectives of this work were to establish YAC contigs covering the low copy-regions of chromosome 4 and 5, and to use these contigs to assemble sequence-ready contigs of cosmids and BACs for distribution to sequence labs. Hybridisation of markers anchored on the genetic map, and the use of YAC end sequences as probes to extend walks, resulted in 4 YAC contigs covering 17.5 Mb of low copy regions of chromosome 4, and 25 contigs covering 27 Mb of chromosome 5. These contigs covered most of the low copy regions of both chromosomes, and extended into subtelomeric repeats, into the rDNA cluster on the short arm of chr 4, and into pericentromeric repeats associated with the putative centromeric regions of each chromosome. Most regions were covered by multiple independent clones. Representative YACs were used as hybridisation probes to select cosmid and BAC clones, which were restriction fingerprinted or AFLP fingerprinted and assembled into contigs. All libraries were prepared using DNA isolated from the Columbia ecotype. Clone overlaps were verified by southern hybridisation. Sequencing was initiated at the FCA and AP2 loci on the 13.5 Mbp lower arm due to the relatively extensive and deep coverage of these regions with YACs at the time the Project began. Guided by the YAC map, a 1.2 Mb cosmid contig at the FCA locus was assembled and restriction mapped. This was achieved by several approaches, including the subcloning of gel-purified YACs into cosmids in order to complete regions not represented in genomic cosmid libraries. Although complete representation of the region was achieved, many of the clones that were sequenced had relatively small inserts compared to the sizes of the vectors. In order to improve the efficiency of sequencing large contiguous regions, Bacterial Artificial Chromosome (BAC) clones became the primary resource for sequencing when they became available. BAC clones from a 4-fold redundant BAC library were identified by hybridisation to YAC clones and were assembled into contigs. Gaps between the contigs were filled using cosmid subclones of YACs. In the AP2 region a 400 kb contig was assembled from both YAC subclone cosmids and BAC clones. Smaller contigs of lambda clones were assembled by restriction fingerprinting and at the SUP region on chr 1, at the EM gene cluster on chr 3, and the EF1 alpha cluster on chr 3.
Quality controls were used at three stages of the sequencing and analysis process. First, sequencers produced sequences on both strands and resequenced potentially inaccurate regions. Comparison of overlap sequences produced by different groups served to assess and control differences between labs. Finally, sequence analysis identified frameshifts in potential genes. These regions and inconsistencies in overlap regions were all resequenced independently by the Volckaert group from both cloned and genome DNA. Based on estimates made in other sequencing projects, the accuracy of the sequence varied between 1/5000 to < 1/10,000 errors. The complex CHPR disease resistance locus contained frameshifts and out-of-frame initiation codons, and a 50 Kb region of this locus was sequenced independently. Assembly of the FCA region involved comparison of 246,499 bp of overlap sequences in 2,094,637 bp, and the AP2 region contained 57,515 bp overlapping sequence in 454,122 bp.
In the FCA region sequence was determined from an overlapping contiguous set of 6 BAC clones, 15 Lorist cosmids, 16 CC cosmids, 5 CA t cosmids, and 45 YAC subclone cosmids in the binary vector 04541. Clones were distributed in a network of 17 laboratories for sequencing. The AP2 region sequence was determined from 14 cosmid and 2 BAC clones in two laboratories.
Shotgun sequencing of cosmid and BAC clones was the most common sequencing strategy. In the case of BAC T5D3, which contained a complex of retroelements and 8 similar CHPR resistance genes in the FCA region, each with complex internal repetitive amino acid tracts, the 93 kb region was restriction mapped and subcloned into plasmids. The location of the subclones on the BAC were determined by direct BAC sequencing, and the sequence of subclones was determined by primer walking. Gaps in and between contigs were closed by direct primer walking on the cosmid or BAC clone or sequencing PCR fragments spanning the gaps from clones or genomic DNA.
EST and cDNA Sequencing
The goal was to sequence 3,300 unique ESTs, many from both the 5' and 3' ends, in an existing network of labs that had initiated sequencing prior to joining this network. EST sequencing was completed in May 1996 with 1,846 unique ESTs and cDNA assemblies produced. This was lower than anticipated due to increased redundancy in the libraries used. Parallel US activities ended at about the same time for the same reason. Together, the EU and US efforts produced 22,458 ESTs which have been assembled into 12,134 unique cDNAs. Cognate cDNAs (68,523 bp) from the FCA region and individual areas were cloned and sequenced. These were invaluable for verifying the quality of gene modelling, which can only rely on experimental data for a small proportion of gene models.
After assembly and preliminary accuracy checks, an initial BLASTX analysis was used to compare all reading frames with all protein sequences and separately with the translations of Arabidopsis and other plant ESTs. A search for tRNAs and repeats was also performed. Genefinder, Genmark and XGRAIL were modified using published Arabidopsis sequence and used for the identification of Arabidopsis genes. NetPlantGene was used for recognition of splice sites. Where possible the predictions were checked for consistency with known protein sequences or cognate cDNA sequences, and the gene models manually adjusted accordingly. The predicted protein sequences were extracted using FINDORFS for further analysis. A graphical representation of gene predictions, repeats and other features, and a comparative and structural protein database, is available from the MIPS website. The PEDANT database provides a comprehensive analysis of protein structures and similarities. All of the genomic and cDNA sequences have been lodged with EMBL and submitted for publication.
With EU industry
The Coordinator and other scientists in the Project attended several meetings with the aim of developing the interests of relevant sectors of EU industry in Arabidopsis sequencing and genomics . A Plant Industrial Platform was established that was provided with access to sequence prior to publication and to scientists working in related fields. The general goal was to demonstrate the efficacy of Arabidopsis sequencing for gene discovery in crop plants.
With the international scientific community
The Coordinator and others attended a series of meetings in the US aimed at convincing colleagues there to join in the sequencing effort. This eventually resulted in the formation of the Arabidopsis Genome Initiative in 1996, which facilitates and coordinates sequencing activities in the US, Japan and the EU.
The EU network was the first to initiate systematic large scale sequencing of a plant genome, and have convincingly demonstrated the feasiblity of this work, and the value of the results to the plant science community and relevant industries. The network achieved most of its initial scientific goals, and in many cases sequenced more than originally planned. Where posible sequencing activity was redirected form individual regions to chromosome 4. In addition, scientists in the Network played a key role in establishing an international collaboration to complete the genome sequence as quickly as possible. Most importantly, many of the important metrics of the Arabidopsis genome have been established by the relatively large contig covering 1.9 Mb of chr 4.
Genes with predicted or known functions were classified into 15 putative cellular roles. This list, based on the yeast functional catalogue, is a preliminary attempt at categorising all plant proteins, and new categories and subcategories will probably be added as more genes are sequenced and analysed. These common established categories will be useful for comparisons of cellular functions between different organisms.
The proportion of genes in each role category is shown in Table 4. Of the 206 genes analysed, the largest number were involved in primary and secondary metabolism (32%), reflecting the complex photoautotrophic metabolism of plants. The 14% of genes involved in disease and defence responses may not be representative of the entire genome due to the cluster of 8 resistance gene isologs at the CHRP gene cluster. The high proportion of genes involved in information processing (transcription 15%, and signal transduction, 8%) are typical of complex multicellular organisms.
Four highly significant findings have been made in this pilot-scale sequencing project. First, there is a consistently high gene density over an extended contiguous region The relatively high gene density encountered in this region is also found in other sequenced regions where annotation is available. A more refined approximation of the total gene number as 21,000 is now possible, based on the 10Mb of available sequence from 4 chromosomes, and the size of YAC contigs covering most of the low copy regions of the five chromosomes. Second, the genome sequence has a high information content; 53% of the predicted genes can be assigned cellular roles based on enzymatic, structural or other functions derived from sequence similarity to proteins of known function. Nevertheless, the specific functions of many of these genes in plants, in development and environmental adaptation for example, requires further biological analysis. The remaining 47% of predicted genes, which have no significant similarities to other genes, require both more sophisticated computer analysis and extensive systematic biological experimentation to determine their function. Third, nearly 20% of the predicted genes are members of gene families that may have arisen by gene duplication and divergence. This feature is not as pronounced in other sequenced eukaryotes, such as C. elegans and yeast. If the number of gene families in Arabidopsis is found to be approximately 15,000 after more comprehensive sequencing, Arabidopsis will have a similar-sized genome complement to the other model metazoans, Drosophila and C. elegans. This may represent a minimal number of genes required for the function of complex metazoans with highly diverged mechanisms of development and environmental interactions. Finally, it is now possible to predict that a straightforward shotgun sequencing strategy can generate contiguous sequence from nearly all of the low copy regions of the Arabidopsis genome.
In addition to these general conclusions, a variety of interesting genes have been sequenced in their precise chromosomal locations. For example, the sequence of a putative disease resistance locus will provide insight into the mechanisms generating the diversity of resistance genes with different pathogen specificities. The identification of genes encoding the three classes of terpenoid cyclase enzymes provides a foundation for a molecular genetic analysis of this complex and commercially important pathway. The sequence of the nuclear genome will permit the complex interplay between the three genomes of plant cells to be understood in far greater detail than hitherto possible, as the mitochondrial and chloroplast genomes have been completely sequenced.
The identification of all the protein coding genes and other non-genic features in their precise location in a plant genome has two general consequences. First, it provides both a framework for the systematic identification of the functions of plant genes by screening for T-DNA and transposon insertions. Present resources, established exclusively in the framework of EC-funded networks, have the potential to create disruptions in most of the genes in this region. Second, the gene order will help to define regions of conserved gene order in crop plant species for the important purpose of identifying orthologous genes.
Funding SchemeCSC - Cost-sharing contracts
6700 AA Wageningen
91190 Gif Sur Yvette
6700 AE Wageningen
3584 CH Utrecht
NR4 7TJ Norwich