Skip to main content

STATISTICAL ANALYSIS OF PROTEIN SEQUENCES TO INFER 3D STRUCTURE AND FUNCTION

Final Report Summary - EVO-COUPLINGS (STATISTICAL ANALYSIS OF PROTEIN SEQUENCES TO INFER 3D STRUCTURE AND FUNCTION)

Marie Curie CIG reporting document

Summary
Overall the sub-parts of Aims 1 and 2 have been addressed both for protein sequence alignments, and also, beyond the work envisaged in the grant proposal, in both the context of protein-protein interactions and in addition the context of small molecule ligand binding to protein receptors. In this latter context, the aim is to use covariance analysis to build a model of protein ligand binding. Analogously to protein tertiary structure prediction, this model is used to predict whether or not any given ligand is likely to bind to a protein receptor of interest.

Major Results Summary
Bitbol, A. F., Dwyer, R. S., Colwell*, L. J., & Wingreen*, N. S. (2016). Inferring interaction partners from protein sequences. Proceedings of the National Academy of Sciences, 113(43), 12180-12185.

Specific protein−protein interactions are crucial in the cell, both to ensure the formation and stability of multiprotein complexes and to enable signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interaction partners, causing their sequences to be correlated. Here we exploit these correlations to accurately identify, from sequence data alone, which proteins are specific interaction partners. Our general approach, uses a pairwise maximum entropy model to infer couplings between residues. We introduce an iterative algorithm to predict specific interaction partners from two protein families whose members are known to interact.. We obtain a striking 0.93 true positive fraction on our complete dataset (of bacterial two component signaling systems, and ABC transporter complexes) without any a priori knowledge of interaction partners.

Lee, Alpha A., Brenner, Michael P., and Lucy J. Colwell. "Predicting protein–ligand affinity with a random matrix framework." Proceedings of the National Academy of Sciences 113.48 (2016)

Rapid determination of whether a candidate compound will bind to a particular target receptor remains a stumbling block in drug discovery. We use an approach inspired by random matrix theory to decompose the known ligand set of a target in terms of orthogonal “signals” of salient chemical features, and distinguish these from the much larger set of ligand chemical features that are not relevant for binding to that particular target receptor. After removing the noise caused by finite sampling, we show that the similarity of an unknown ligand to the remaining, cleaned chemical features is a robust predictor of ligand–target affinity, performing as well or better than any algorithm in the published literature.

Colwell, L., & Qin, C. (2018). Power Law Tails In Phylogenetic Systems. Proceedings of the National Academy of Sciences of the United States of America https://doi.org/10.1073/pnas.1711913115

Covariance analysis of protein sequence alignments uses coevolving pairs of sequence positions to predict features of protein structure and function. However, current methods ignore the phylogenetic relationships between sequences, potentially corrupting the identification of covarying positions. Here, we use random matrix theory to demonstrate the existence of a power law tail that distinguishes the spectrum of covariance caused by phylogeny from that caused by structural interactions. The power law is essentially independent of the phylogenetic tree topology, depending on just two parameters - the sequence length, and the average branch length. We demonstrate that these power law tails are ubiquitous in the large protein sequence alignments used to predict contacts in 3D structure, as predicted by our theory. This suggests that to decouple phylogenetic effects from the interactions between sequence distal sites that control biological function, it is necessary to remove or downweight the eigenvectors of the covariance matrix with largest eigenvalues. We confirm that truncating these eigenvectors improves contact prediction.

Colwell, Lucy J. "Statistical and machine learning approaches to predicting protein–ligand interactions." Current opinion in structural biology 49 (2018): 123-128.

Data driven computational approaches to predicting protein–ligand binding are currently achieving unprecedented levels of accuracy on held-out test datasets. Up until now, however, this has not led to corresponding breakthroughs in our ability to design novel ligands for protein targets of interest. This review summarizes the current state of the art in this field, emphasizing the recent development of deep neural networks for predicting protein–ligand binding. We explain the major technical challenges that have caused difficulty with predicting novel ligands, including the problems of sampling noise and the challenge of using benchmark datasets that are sufficiently unbiased that they allow the model to extrapolate to new regimes.

Mitchell, Laura S., and Lucy J. Colwell. "Comparative analysis of nanobody sequence and structure data." Proteins: Structure, Function, and Bioinformatics 86.7 (2018): 697-706.

Nanobodies are a class of antigen-binding protein derived from camelids that achieve comparable binding affinities and specificities to classical antibodies, despite comprising only a single 15 kDa variable domain. Their reduced size makes them an exciting target molecule with which we can explore the molecular code that underpins binding specificity—how is such high specificity achieved? Here, we use a novel dataset of 90 nonredundant, protein-binding nanobodies with antigen-bound crystal structures to address this question. To provide a baseline for comparison we construct an analogous set of classical antibodies, allowing us to probe how nanobodies achieve high specificity binding with a dramatically reduced sequence space. Our analysis reveals that nanobodies do not diversify their framework region to compensate for the loss of the VL domain. In addition to the previously reported increase in H3 loop length, we find that nanobodies create diversity by drawing their paratope regions from a significantly larger set of aligned sequence positions, and by exhibiting greater structural variation in their H1 and H2 loops.

Mitchell, Laura S., and Lucy J. Colwell. "Analysis of nanobody paratopes reveals greater diversity than classical antibodies." Protein Engineering, Design and Selection 31.7-8 (2018): 267-275.

Nanobodies (Nbs) are a class of antigen-binding protein derived from camelid immune systems, which achieve equivalent binding affinities and specificities to classical antibodies (Abs) despite being comprised of only a single variable domain. Here, we use a data set of 156 unique Nb:antigen complex structures to characterize Nb–antigen binding and draw comparison to a set of 156 unique Ab:antigen structures. We analyse residue composition and interactions at the antigen interface, together with structural features of the paratopes of both data sets. Our analysis finds that the set of Nb structures displays much greater paratope diversity, in terms of the structural segments involved in the paratope, the residues used at these positions to contact the antigen and furthermore the type of contacts made with the antigen. Our findings suggest a different relationship between contact propensity and sequence variability from that observed for Ab VH domains. The distinction between sequence positions that control interaction specificity and those that form the domain scaffold is much less clear-cut for Nbs, and furthermore H3 loop positions play a much more dominant role in determining interaction specificity.