Determination of the entire nucleotide sequence of S. cerevisiae chromosomes II (840 Kb) and XI (680 Kb) by the end of 1993. Identification, thanks to this systematic sequencing exercise, of over 500 totally new genes never described before.
2 yeast chromosomes, II and XI, are under sequencing. 200 kb from chromosome II and 290 kb from chromosome XI have been sequenced, the total length comprising 620 and 830 kb, respectively. Preliminary sequence data provided information about clone orientation and overlapping regions, and helped in the rapid identification of known genes. A detailed plot displaying the clones used in sequencing, all open reading frames (ORF) as well as previously mapped genes and other elements, was constructed.
A software tool that greatly facilitates data handling has been developed. Using an X-windows interface, the program allows the rapid display of chromosome clones as well as restriction sites and ORFs. It is capable of scaling up a selected region, handling all restriction enzymes available, and marking sequenced fragments within clones. 2 databases have been created and are maintained for the yeast project. For the yeast protein sequence database (Yeast Prot) all entries already existing in databases were checked for errors by comparing to the nucleic acid sequence, where this was available. Errors were corrected accordingly and partial sequences merged to eliminate redundancy of information. The current release includes 978 entries comprising 431 867 residues. The yeast nucleic acid sequence database (Yeast Nuc) is a merged dataset, ie assembled using yeast sequences from several databases and eliminating entries containing identical sequences. However, entries differing in at least one necleotide are present in Yeast Nuc. This has the advantage of providing an up to date dataset for fast searches at the cost of partially redundant information and entry format diversity (each Yeast Nuc entry inherits the format of the parent database). The current release includes 1889 entries comprising 3261167 bases. An online computing facility was established that enables the participating laboratories to analyze their data and perform database queries.
MIPS, acting as the informatics coordinator of the European Yeast Genome Sequencing Project is responsible for the collection, storage and analysis of all sequence data submitted by the laboratories involved. Submissions arrive by e-mail or on magnetic media. Preliminary sequence data are also sent to MIPS; they provide valuable information about clone orientation and overlapping regions and help in rapidly identifying known genes. The processing scheme employed is generally outlined as follows.
Construction of a restriction map based on the sequence. This map is compared to the map of the clone library to verify clone orientation.
The sequence is used to check for overlaps:
with vectors (using the VecBase vector database);
with known yeast sequences;
with other clones of the same chromosome.
Pattern search is then performed at the DNA level in order to detect: promotors, upstream activating sequences, autonomously replicating sequences, introns, tRNA genes, other yeast-specific regulatory sequences, repeats.
Extraction of open reading frames (ORFs). The locations of these ORFs are correlated to the positions of promotors and terminators in order to assess the probability of expression.
For the analysis of ORFs, the following steps are employed.
FASTA rapid sequence comparison all known protein sequences available (approx. 40,000). This step is essential not only for detecting similarities of ORFs to other sequences in the databases, but also for reidentifying known genes detected already at the DNA level and thus exclude frameshift errors. For FASTA scores that do not reflect unambiguous similarities, more sensitive comparison methods are applied.
Pattern search is then performed at the protein level using the ProSite Dictionary of Protein Sites and Patterns.
The ORFs are scanned for internal repeats; search for putative transmembrane segments is performed if the methods outlined above give indications for a membrane protein.
The chromosome contig is gradually assembled based on the clone overlap data. The locations of previously mapped genes are correlated to the existing physical map of the chromosome. The new physical map is compared to the genetic map.
Furthermore, the yeast specific sequence databases have been created and are being maintained for the yeast project.
Finally, an online computing facility was established that enables participating laboratories to analyze their data and perform database queries. Two mail servers to query all sequence databases available are open for public use.