Skip to main content

Hierarchical Motif Vectors for Protein Alignment and Functional Classification

Final Report Summary - MOTIF VECTORS (Hierarchical Motif Vectors for Protein Alignment and Functional Classification)

The two principal tasks of computational analysis of proteins based on their amino acid sequences are the determination of proteins related to one another either by function or by evolutionary development, and the identification of specific amino acid combinations, or motifs, that determine the protein function. Both of these two tasks have largely been addressed by matching amino acid sequences across different proteins. Sequence alignment algorithms compute similarity scores between the amino acid sequences of different proteins that are then used to identify protein subgroups with high within-group similarity. Amino acid combinations observed consistently among proteins of a specific functional subgroup constitute the sequence motifs of the related subgroup, the presence of which indicates membership of the protein in that functional group. The primary challenge in the computational analysis of amino acid sequences is the combinatorial complexity inherent in representing amino acid sequences as words composed of letters from an alphabet of twenty, with each letter corresponding to a different amino acid.

Faced with the daunting prospect of evaluating potentially millions of possible amino acid combinations for functional specificity, we introduce a numerical alternative that characterizes protein structure from amino acid sequences via numerical means using techniques from multi-scale signal decomposition and statistical learning. The proposed framework is based on a notion of hierarchical motif vectors that capture the numerical variation of the local physico-chemical composition along a protein’s amino acid sequence. This allows using an extensive library of vector space data processing methods for rigorously computing the similarity of the corresponding amino acid sequence motifs, both in the alignment of amino acid sequences as well as the identification of motifs specific to functional protein groups.

This project starts with developing global and local alignment methods for sequences of motif vectors to establish correspondence between the underlying amino acid sequences. Next, it identifies hierarchical motif vectors that possess functional or structural specificity in select protein groups via quasi-supervised statistical learning. Finally, it formulates a protein function recogniton strategy based on group-specific hierarchical motif vectors.

The experimental results on local as well as global motif vector alignment indicate that the motif vectors adequately characterize the physico-chemical composition along amino acid sequences and allow associating segments sharing similar amino acid configurations at short, mid and long range neighborhoods along their respective sequences. This allows establishing associations between amino acid sequence segments that share similar functions due to amino acid configurations that share similar their physico-chemical properties.

Results on the prediction of N-glycosylation at consensus sequence sites also confirm that the hierarchical motif vectors accurately characterize the physico-chemical configurations at and around amino acid sites for functional significance. Furthermore, the quasi-supervised learning strategy can sort through the prospective sites of activity and identify the ones with real functional potential based on their respective motif vectors. The quasi-supervised learning strategy is especially fitting to biomedical information processing tasks where a relatively small collection of instances are available with experimentally verified attributes against the backdrop of a very large number of unknown prospects. The quasi-supervised learning algorithm successfully separates the probable prospects from the unlikely ones automatically with no user intervention.