Skip to main content

Searching Protein Structure Space

Final Report Summary - SPSS (Searching Protein Structure Space)

The main objective of the project was designing a novel way for building a large-scale search system for proteins in the Protein Data Bank (PDB) that will answer structural queries much faster than existing systems. The proposed system was to be based on an index-able string description of protein structure, so that it can employ an inverted index (a sophisticated data structure that supports fast retrieval in such large datasets).
We proposed a system that belongs to the ‘filter-and-refine’ paradigm. The 'filter-and-refine' paradigm has two steps: First, the filter method quickly sifts through a large set of structures and selects a small candidate set. Then, in the refine stage, the best candidates are identified using existing, accurate (yet computationally expensive) structural alignment heuristics. Since the refine stage uses existing tools on relatively small sets, the challenge lies in designing accurate filter methods, i.e. ones that can identify good candidates for structural similarity. Of course, to speed the retrieval, the filter method is ideally index-able.
During the reported time, we have designed and extensively validated a novel filter method that is based on protein backbone fragment methods. Our method is denoted FragBag, and uses a succinct representation of proteins as ‘bags-of-fragments.’ We describe a protein structure by the collection of its overlapping short contiguous backbone segments and discretize this set using a library of fragments. Then, we represent the protein by a vector that counts the number of occurrences of each fragment. We approximate the similarity between two protein structures, by the similarity of their FragBag vectors. Of course, this representation was designed so that it can be used in an inverted index, for implementing a fast structural search engine of the complete PDB. Our representation has an additional benefit, which was not mentioned in the original proposal: one can specify a structure as a collection of sub-structures, without combining them into a single structure; this is valuable for structure prediction, when there are reliable predictions only of parts of the protein.
A significant challenge, which we addressed in this project, is properly validating our filter method. This was important for two reasons: (1) tuning the parameters of the system, and (2) comparing its performance to other state-of-the-art filter methods. Filter methods order the structures in the database by their similarity to the query, and different methods result in different orderings. Then, different thresholds can be used: we can either consider the 100 most similar structures, or only the 10 most similar structures, depending on the computational resources available in the refine stage. For the purpose of validation, one also needs to determine for each of the structures in the database, if they are truly structurally similar to the query.
We used ROC curve analysis to quantify the success of FragBag in identifying near-neighbor candidate sets in a dataset of over 2,900 protein structures. The gold standard is the set of near structural neighbors found by six state-of-the-art structural aligners. This allowed us to identify proper parameters for the system: the fragment length, the similarity measure between two FragBag vectors. It also allowed us to consider different threshold values for determining if two structures are indeed similar or not (depending on how strict the search engine should be).
Further, when comparing to other, state-of-the-art filter methods: SGM, PRIDE, and a method of Zotenko et al., we show that our best FragBag library finds more accurate candidate sets. We also used the computationally expensive, yet highly trusted, structural aligners STRUCTAL and CE, as filter methods (i.e. we use the results of their structural alignments to sort the structures in the database from the most structurally similar to the query to the least similar). Surprisingly, FragBag performs similarly to the computationally expensive, yet highly trusted, structural aligners STRUCTAL and CE, even though it is order of magnitudes faster.
An additional benefit of this observation is that it allows representing protein structures as fixed-size vectors, embedding them as points in a high-dimensional space, and projecting this space to lower dimensions. Namely, we used this representation of protein structure to create three-dimensional maps of structure space using a very large dataset of > 30,000 Structural Classification of Proteins (SCOP) domains. In our maps, each domain is represented by a point, and the distance between any two points approximates the structural distance between their corresponding domains. We use these maps to study the spatial distributions of properties of proteins, and in particular those of local vicinities in structure space such as structural density and functional diversity. These maps provide a unique broad view of protein space and thus reveal previously undescribed fundamental properties thereof. At the same time, the maps are consistent with previous knowledge (e.g. domains cluster by their SCOP class) and organize in a unified, coherent representation previous observation concerning specific protein folds. To investigate the function-structure relationship, we measure the functional diversity (using the Gene Ontology controlled vocabulary) in local structural vicinities. Our most striking finding is that functional diversity varies considerably across structure space: The space has a highly diverse region, and diversity abates when moving away from it. Interestingly, the domains in this region are mostly alpha/beta structures, which are known to be the most ancient proteins. We believe that our unique perspective of structure space will open previously undescribed ways of studying proteins, their evolution, and the relationship between their structure and function.