Skip to main content
European Commission logo print header

Noncoding RNA Comparative Searching System

Final Report Summary - NARCISUS (Noncoding RNA comparative searching system)

Structural genomics is the wide term which describes process of determination of structure representation of information in human genome and at present is limited almost exclusively on proteins. Although in common understanding genetic information means 'genes and their encoded protein products', thousands of human genes produce transcripts which are important in biological point of view but they do not necessarily produce proteins. Furthermore, even though the sequence of the human deoxyribonucleic acid (DNA) is known by now, the meaning of the most of the sequences still remains unknown. It is very likely that a large amount of genes has been highly underestimated, mainly because the actual gene finders only work well for large, highly expressed, evolutionary conserved protein-coding genes. Most of those genome elements encode for ribonucleic acid (RNA) from which transfer and ribosomal RNAs (rRNAs) are the classical examples. But beside these well-known molecules there is a vast unknown world of tiny RNAs that might play a crucial role in a number of cellular processes. Those elements are named noncoding RNAs (ncRNA) and they perform their function without transcription to the protein product.

The aim of the 'NARCISUS' (Noncoding RNA comparative searching system) project was designing and integrated bioinformatics platform specifically addressed for detecting, verifying and classifying of ncRNAs. This complex approach was then used as a pipeline for detection of small nucleolar RNAs (snoRNAs) H/ACA type containing RNA motifs with low sequence conservation. The algorithm used in 'NARCISUS' significantly improved the quality of the RNA homolog search.

For the purpose of the project several new databases have been created. The first was a database of human snoRNAs of H/ACA type confirmed experimentally. The database includes 44 known RNA sequences with an average length 130bp. The data was collected from sequences available in public databases and the public available literature.

Second created a database was the pseudouridylation sites database on 18S rRNAs and 28S rRNAs. The dataset contains 28 known modifications sites of 18S rRNA and 182 sites in 28S rRNA respectively.

Another prepared database was a database of human introns. Initially, the complete human genome sequences were downloaded from the websites of the National Institutes of Health, Unites States of America (USA). Human DNA was divided into 23 separate sets. Then the human genomic sequences were searched for the presence of repeated sequences and a sequence of low complexity using Repeatmasker programme. This way the amount of data was reduced to about 40 %. The extracted human genome sequences were compared using the Blast database against Pfam protein families to find potential new unknown protein-coding region. Using the database of protein families, instead of representatives of all possible sequences reduced the number of calculations needed to identify known proteins. For comparison the nucleotide sequences was derived from Pfam using TBlastn programme of Blast package. This comparison also helped to identify the location of introns in newly created database. The resulting database contained 8350 human introns sequences.

In order to detect non-coding snoRNAs new search algorithm has been introduced. In the first step of algorithm sequences are tested in for the presence of conserved elements H and ACA. Then the presence of conserved secondary structure is checked by RNAMotif. In the next stage, pseudo-energy test is performed to check the strength of evaporation in the inner loops and the ability to interact with other elements of secondary structure. The last step is to find the possible effects of antisense pseudouridylation in pseudouridylation database. Secondary structure motif used to search for snoRNAs was created by analysing topology of experimentally identified snoRNAs from previously created database. For secondary structure analysis mFOLD programme was used.

Using this methodology more than 10 000 candidates for snoRNAs of H/ACA type has been detected. The cross search against known pseudouridylation site resulted in eight new candidates for snoRNAs in the human genome, each of them located in intron regions of protein coding sequences. It is interesting that with the exception of one all candidates are located very close to the telomeric regions and centromeres.

The total number of known human pseudouridylation sites both 28S and 18S rRNA is 182 and 28 respectively are known experimentally identified snoRNAs. Assuming that all snoRNAs found in this study were correctly identified there are still from 18 to 36 unidentified snoRNA. This also may be explained by different - not yet discovered pseudouridylation mechanism.