Final Report Summary - INTOACT (Assessment of the regulatory mechanisms of different protein post-translational modifications by studying their role within a realistic in silico model of the cell.)
The InToAct project proposed to study this type of protein post-translational regulation from a systems biology perspective. Our major goal was to characterize the action of protein PTMs within a molecular network of functional associations to be able to learn patterns that would help us to contribute to the deciphering of the so called PTM code. We scheduled three intermediate objectives, i) the generation of a highly annotated interactome including different types of cellular elements and relationships; ii) the characterization of the network features of the PTMs types and identification of recurrent patterns with specific post-translational regulatory events involved; and iii) the analysis of the conservation of the PTM residues across species to calibrate the annotation transfer to non-model organisms.
The InToAct project scheduled from 01/09/2011 to 31/08/2013 a set of sequential steps to address and achieve the intermediary objectives proposed. The project starting point was the collection of the regulatory protein information available which includes both the collection of the experimentally validated PTMs from databases, high-throughput experiments and from the scientific literature and the collection of other types of regulatory annotation from public databases. For the PTM collection duty I collected more than 420,000 experimentally verified PTMs and applied a preprocessing pipeline to remove sequence redundancy. I also developed a set of Natural Language Processing rules for the extraction of PTMs from the scientific literature that it is in use by the STRING database. The final first dataset under study was composed by more than 115,000 modified residues of 13 PTM types from 8 eukaryotic species. This has been to the best of our knowledge the largest PTMs dataset ever analyzed jointly. We also collected nearly 900,000 protein-protein interactions and almost 4 million protein-protein functional associations from public databases to build the molecular network representing the global scenario where the PTM regulation was studied. In order to annotate this large molecular network we developed, as scheduled, an ontology (a standard vocabulary) of actions describing the major functional association that may happen within the cell. This ontology is also in use in the STRING database (Franceschini et al., 2013), probably the most cited and used public database of protein physical and functional interactions.
After data collection and the generation of the functional annotation tools scheduled, we developed the methodologies required to study this type of protein regulation within several species and over evolution. Thus, during the first year we designed a novel algorithm to measure protein residues conservation, the Residue Conservation Score (RCS) that we used to measure the speed of evolution of different PTM types. We analyzed the evolutionary conservation of 13 different PTM types in several eukaryotes using the conservation of the modified amino acid as a proxy for the conservation of the function (Gnad et al., 2007; Holt et al., 2009; Tan & Bader, 2012) and determined that there were a differential speed of evolution over PTM types where carboxylation clearly stands as the most conserved while SUMOylation is the fastest evolving PTM (Minguez et al., 2012). Our work on PTM types speed of evolution represented the first high-throughput approach to compare several different PTM types within an accurate statistical framework.
The main objective for the InToAct project was the characterization of the PTM types within the molecular network in order to extract patterns that would link the PTM types to particular functions. We first studied the outcome of the combinatorial action of the PTMs present in the same protein. Here we proposed the co-evolution of the modified residues across species as a proxy for their functional association. This co-evolution analysis determined that certain PTM types combinations co-evolve more than what it is expected by chance with no correlation with the co-occurrences of the PTM types within the proteins. Thus we could extract functionally associated pairs of PTM types based on the generalization of individual protein post-translational regulation predictions. Searching for common features of the proteins having the same type of predicted regulation we found that proteins sharing pairs of associated PTM types were enriched in specific sub cellular localization and functionalities as well as enriched in protein-protein interaction networks and other regulatory elements such as protein short linear motifs and globular domains showing their regulatory role (Minguez et al., 2012). As co-evolution cannot deal with all the possible functional associations mechanisms we augmented our scope for the detection of functionally associated PTMs developing more methods, based on the proximity in the protein structure, their competition for the same site and the extraction of protein regions (hotspots) with high PTM density (Minguez et al., 2013). In order to share our findings with the scientific community we developed a public web-based database called PTMcode (http://ptmcode.embl.de/) where researchers world-wide interested in particular proteins or groups of proteins can search for our predictions of their post-translational regulation.
Looking at the big picture represented by the ppi network (interactome) we first showed that proteins with the same pattern of PTM regulation were more connected among themselves than expected by chance (Minguez et al., 2012). Besides this, in a posterior analysis we found that proteins with higher number of PTMs and more PTM types have also significantly higher connectivity and are more central within human, mouse and yeast interactomes. Within this work we also found particular protein clusters, based on their interactions, enriched in particular combinations of PTMs types, those clusters were compared across species and characterized according to their functionality which gave us some overall patterns on the evolution of the PTM types combinations and their function (manuscript in preparation).
In a second exercise of data collection we doubled the number of PTMs in a total of 20 eukaryotes increasing also the number of PTM types selected up to 69. Using this augmented dataset we applied the set of algorithms developed to predict functionally associated PTMs to explore the combinatorial patterns behind the regulation of both protein-protein physical and functional interactions. We have found significantly co-evolving pairs of PTM types that are candidates to regulate particular protein-protein interactions and interesting combinatorial patterns linked to functionality that are under further investigation. We have also set up a protocol to study the physical interaction of PTM within protein interaction interfaces. The predictions from both analyses are going to be incorporated into the second release of the PTMcode database.
Using our evolutionary analysis framework we propagated the PTM annotation we get from current high-throughput experiments to proteins in other species based on the conservation of the modified residues as a prediction method, thus we account for close to 3,000,000 candidate PTMs in 20 eukaryotes. This is quite important as the high-throughput experiments available still cover few PTM types in few organisms. The predictions will be also available in the second release of the PTMcode database.
To sum up, the InToAct project reached the scientific objectives proposed, we have studied the post-translational protein regulation from a systems biology point of view which represents the first contribution to the characterization of the function of PTM types and their combinations in proteome scale and over evolution. Our paper (Minguez et al., 2012) was highlighted within a “News and Views” article (Creixell & Linding, 2012) which shows the innovative aspect of the work. The PTMcode database (Minguez et al., 2013) represents an excellent platform to communicate and share our results with the scientific community. It was listed within the top 5% papers in terms of originality, significance and scientific excellence published in its issue in the NAR journal and gives the InToAct project a special visibility.
The InToAct project has also promoted the granted fellow, Pablo Minguez, to achieve his career related goals. From the strictly scientific point of view, he has increased substantially his number and quality of scientific publications. It is worth to highlight the already published two first author publications (Minguez et al., 2012, 2013) in two high-impact journals (11.3 and 8.278 impact factors respectively), a collaboration within the STRING consortium (Franceschini et al., 2013) and two more papers on different collaborations that were direct result of the techniques learnt by the fellow within the project scope (Chen, Minguez, Lercher, & Bork, 2012; Doerks, van Noort, Minguez, & Bork, 2012). The fellow has also stablished a number of ongoing and future collaborations that are contributing to the excellence of the research performed by him in the present as well as augmenting his scientific network for future collaborations once he became an independent researcher.
References
Chen, W.-H. Minguez, P., Lercher, M. J., & Bork, P. (2012). OGEE: an online gene essentiality database. Nucleic acids research, 40(Database issue), D901–6. doi:10.1093/nar/gkr986
Creixell, P., & Linding, R. (2012). Cells, shared memory and breaking the PTM code. Molecular systems biology, 8, 598. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3421448&tool=pmcentrez&rendertype=abstract
Doerks, T., van Noort, V., Minguez, P., & Bork, P. (2012). Annotation of the M. tuberculosis hypothetical orfeome: adding functional information to more than half of the uncharacterized proteins. PloS one, 7(4), e34302. doi:10.1371/journal.pone.0034302
Franceschini, A., Szklarczyk, D., Frankild, S., Kuhn, M., Simonovic, M., Roth, A., … Jensen, L. J. (2013). STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic acids research, 41(Database issue), D808–15. doi:10.1093/nar/gks1094
Gnad, F., Ren, S., Cox, J., Olsen, J. V, Macek, B., Oroshi, M., & Mann, M. (2007). PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome biology, 8(11), R250. doi:10.1186/gb-2007-8-11-r250
Holt, L. J., Tuch, B. B., Villén, J., Johnson, A. D., Gygi, S. P., & Morgan, D. O. (2009). Global analysis of Cdk1 substrate phosphorylation sites provides insights into evolution. Science (New York, N.Y.) 325(5948), 1682–6. doi:10.1126/science.1172867
Minguez, P., Letunic, I., Parca, L., & Bork, P. (2013). PTMcode: a database of known and predicted functional associations between post-translational modifications in proteins. Nucleic acids research, 41(Database issue), D306–11. doi:10.1093/nar/gks1230
Minguez, P., Parca, L., Diella, F., Mende, D., Kumar, R., Helmer-Citterich, M., … Bork, P. (2012). Deciphering a global network of functionally associated post-translational modifications. Molecular Systems Biology, 8. doi:10.1038/msb.2012.31
Tan, C. S. H., & Bader, G. D. (2012). Phosphorylation sites of higher stoichiometry are more conserved. Nature Methods, 9(4), 317–317. doi:10.1038/nmeth.1941