New tool to improve protein database quality control emerges Public databases continue to have abnormal, incomplete and mispredicted genes and proteins in spite of the recent efforts made to improve the computational annotation of genomes. These errors adversely impact the reliability of the databases. But a team of researchers has deve... Public databases continue to have abnormal, incomplete and mispredicted genes and proteins in spite of the recent efforts made to improve the computational annotation of genomes. These errors adversely impact the reliability of the databases. But a team of researchers has developed a new tool offering an efficient means for database quality control, called 'MisPred'. The work, recently published in the open access journal BMC Bioinformatics, was carried out as part of the EU project BioSapiens, supported with EUR 12 million in funding under the thematic area 'Life sciences, genomics and biotechnology for heath' of the Sixth Framework Programme (FP6). The MisPred tool, said the research team, uses five 'routines' for identifying suspect abnormal, incomplete or mispredicted entries based on the rationale that a sequence will probably be incorrect if some of its features conflict with the existing knowledge about protein-coding genes and proteins: (1) extracellular or transmembrane proteins must have appropriate secretory signals; (2) a transmembrane segment must be found on a protein with intra- and extra-cellular parts; (3) extracellular and nuclear domains must not occur in a single protein; (4) the number of amino acid residues in closely related members of a globular domain family must fall into a relatively narrow range; and (5) a protein must be encoded by exons located on a single chromosome. The team was led by Professor László Patthy, a researcher at the Institute of Enzymology of the Hungarian Academy of Sciences. 'Recent studies have shown that a significant proportion of eukaryotic genes are mispredicted at the transcript level,' said Professor Patthy. 'As the MisPred routines are able to detect many of these errors, and may aid in their correction, we suggest that it may significantly improve the quality of protein sequence data based on gene predictions.' Professor Patthy pointed out, however, that a number of secreted proteins may actually 'lack secretory signal peptides because they are subject to leaderless protein secretion'. The Institute member added, 'Similarly, it cannot be excluded at present that transchromosomal chimeras can be formed and may have normal physiological functions. Nevertheless, the fact that MisPred analyses of protein sequences of the Swiss-Prot database identified very few such exceptions indicates that the rules of MisPred are generally valid.' The research showed that most mispredictions emerge due to the absence of expected signal peptides and violation of domain integrity. 'Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON predicted entries,' the team said. According to the team, the MisPred approach will give researchers more time to conduct more studies in erroneously identified genes.