Skip to main content
European Commission logo print header

Discovering functional protein-RNA interactions through data integration and machine learning.

Periodic Reporting for period 1 - DeepRNA (Discovering functional protein-RNA interactions through data integration and machine learning.)

Reporting period: 2018-03-01 to 2020-02-29

RNA-binding proteins are implicated across a wide spectrum of human genetic disorders, with molecular mechanisms ranging from aggregation of proteins and RNAs to defects in splicing, localisation and translation. Examples include heterogeneous and life-threatening genetic disorders such as Diamond-Blackfan anaemia, retinitis pigmentosa, spinocerebellar ataxia, and amyotrophic lateral sclerosis (ALS) among others.

The DeepRNA project targeted genetic disease predisposition via disease-associated variants in the human transcriptome and was enabled by recent data on expression quantitative trait loci (eQTLs) and experimentally determined RNA-protein and RNA-RNA interactions. These data were complemented with high-quality protein-RNA interaction predictions carried out in the host group, which had a strong track record in computing and validating ribonucleoprotein associations. The project has expanded the human protein–RNA interactome in a genome-wide manner beyond experimental data, which is available for only 352 of the 1,542 recently described RNA-binding proteins. The information in this extended interactome should contribute to progress towards precision medicine.

The deliverables of DeepRNA were designed to be of direct use in the clinical assessment of potentially pathogenic genomic variants. It is my hope that this will directly help to improve the diagnostics performance and value of clinical analysis products, as well as delivering wider inspiration for artificial intelligence applications in biology and RNA-protein interaction network research, increasing the competitiveness of European research and innovation in these fields. A short international secondment at a leading and highly innovative genomic machine learning group, the Kundaje lab at Stanford University (USA), also gave me an opportunity to disseminate the results of my project and to acquire detailed knowledge on advanced machine learning methods applicable to genomic data.

The key deliverable of the project, RNAct, a functionally annotated comprehensive reconstruction of the human RNA-protein interaction network rooted in the authoritative Ensembl and UniProt resources, is intended to be of long-term usefulness to researchers across sectors and disciplines and to complement the already excellent profile of European public resources in genomics. It has been published in Nucleic Acids Research and is accessible at https://rnact.crg.eu. It is now fully integrated as an external database in UniProt, the authoritative protein information database.
The DeepRNA project modelled the human RNA-protein interactome using experimental data and predictions.

A pilot RNA-protein interaction network was generated from current ENCODE eCLIP protein–RNA interaction data for 119 proteins. Additional RBP-RNA interactions covering the entire human proteome were then predicted using the “catRAPID” method, as published by the host group.
To prioritise interactions of interest using human coding and non-coding variants that affect the network, the pilot RNA-protein interaction network was then enhanced by overlaying information and trans- and cis-eQTLs (expression quantitative trait loci).
The pilot RNA-protein interaction network was then further enhanced by integrating human disease-associated and natural variation data to test the robustness and disease relevance of specific interactions using prediction methods.

A final full-coverage RNA-protein interaction network, integrating additional experimental data and systematic prediction method refinements, was completed and a database interface web server was developed (https://rnact.crg.eu). This website is intended to be the first easily accessible resource for high-quality human RNA-protein interaction data.

Additionally, machine learning was intended to be applied to newly identify interactions of potential medical relevance to arrive at a prioritised list of likely disease-relevant protein-RNA interactions, to be followed up within the host group by experimental validation. A short secondment at the Kundaje lab, a genomic machine learning group at Stanford University, allowed me to initiate a collaboration aiming to develop a deep neural network classifier trained to identify potential disease-relevant variants within the human RNA-protein interactome. However, due to the immense technical challenge of this, this project is currently still in its starting phase.

Two articles were published relating to the project, both in separate Nucleic Acids Research special database issues: a database interface web server was developed (https://rnact.crg.eu). This web server now provides easy access to human and mouse protein–RNA interaction data generated by the ENCODE Project, the largest and most consistent such effort to date. It is aimed at experts and non-experts alike. RNAct is also now linked out to by the authoritative UniProt protein database as a cross-referenced resource, which greatly increases its reach and visibility. The second is BacFITBase, a database collating information on the essentiality of bacterial genes during host infection in various vertebrate species, accessible at http://www.tartaglialab.com/bacfitbase.
The RNAct database web server (https://rnact.crg.eu) provides the first proteome- and transcriptome-wide view of the human, mouse and yeast protein–RNA interactomes, enabled by novel genome-wide protein–RNA interaction predictions. It is the only such resource with proteome- and transcriptome-wide coverage. It also provides the first easy non-bulk access to eCLIP protein–RNA interaction data from the ENCODE Project, to experts and non-experts alike. Its intelligent search term processing allows real-world use by biologists of all specialisations. Its recent integration into UniProt, the authoritative protein information database, as a cross-referenced resource greatly increases its reach and impact.

Likewise, the BacFITBase database (http://www.tartaglialab.com/bacfitbase) enables easy access to bacterial pathogens' gene essentiality during infection, and is likewise an effort to make data from multiple individual publications available to experts and non-experts alike. It currently integrates data from 15 transposon mutagenesis studies covering 15 pathogenic bacteria and 5 host vertebrates across 10 different tissues. In light of the ongoing COVID-19 pandemic, I believe that the importance of being able to intelligently target treatment efforts at pathogens is clearly apparent.
figure-2-search.png
figure-3-protein.png