Skip to main content

Exploring new applications of amino acid covariation analysis in modelling proteins and their complexes

Periodic Reporting for period 2 - ProCovar (Exploring new applications of amino acid covariation analysis in modelling proteins and their complexes)

Reporting period: 2018-05-01 to 2019-10-31

"The ProCovar project aims to investigate novel applications of amino acid residue covariation in proteins. Recently, massive genome sequencing projects have resulted in the availability of huge numbers of protein sequences, spread across many diverse organisms. The latest sequence data banks now have hundreds of millions of sequences, and this volume of data alllows much more sophisticated evolutionary analysis to be done. One of the most exciting of these new approaches is to move away from so called Markov models of protein evolution, where mutations to each residue in a protein family are modelled independently of changes occurring at other residue positions. With very large collections of sequences, deeper patterns emerge, and we are able to detect situations where residues ""co-evolve"" i.e. changes in one position does influence the likelihood of observing changes at other positions. This covariational data has been shown to have very exciting applications in the modelling of protein structure, the prediction of gene function and the inference of interactions between proteins. Key outputs of this project are computational tools which we make freely available as open source, and which can be used by both theoreticians and experimentalists alike. Given the large number of possible applications of covariation data, and the continual growth of sequence data banks making more and more families tractable for covariation analysis, successful completion of this whole programme of work could have significant impact across almost all areas of biomedicine, particularly as both transmembrane and disordered proteins are so hard to study by any other means than bioinformatics approaches.

Prediction of inter-residue contacts (DeepCov and DeepMetaPSICOV)

Residue covariation methods aim at predicting contacting residues in protein structures. Methods such as our own PSICOV, though effective, are limited in that they require many diverse, homologous protein sequences in order to achieve satisfactory performance. We have recently developed methods that significantly extend the range of cases in which good contact predictions can be obtained. Our methods employ AI (deep convolutional neural network) models in order to learn patterns in residue covariation data across protein families. When trained, we find these models can significantly outperform traditional methods. Our latest tool, DeepMetaPSICOV (DMP) was ranked highly in the CASP13 experiment. The software is publicly available via GitHub.

Large-scale de novo structure modelling (DMPfold)

The ultimate goal of predicting contacts is to use that information to predict whole protein structures. Building on our experience in contact prediction, we developed new AI-based predictors of inter-residue distances, backbone torsion angles and hydrogen bonds. Using these predictions as constraints to an off-the-shelf method originally used for X-ray and NMR structure determination, we are able to accurately predict structures for a vast array of proteins. Interestingly, the method can be used without modification on membrane proteins and achieves good results. Predictions are accompanied by accurate estimates of their likely correctness. Perhaps most crucially, the method is fast enough to be run on whole genomes, meaning we are able to expand the structural coverage of proteomes of importance to biological research. Again, the software is freely available via GitHub.

Protein design/Synthetic Biology

New developments in machine learning allow the generation of new examples of images, objects or other data, given large training sets of similar entities. Recent work in my lab has shown that it is possible to use the same technology to modify existing protein sequences in order to introduce novel functionality or features, such as metal binding sites, into proteins that don't have them. This opens the door to a wide variety of protein design tasks, and work aimed at these designs in the laboratory is now underway. Although so far the machine learning models we have been using have not produced the desired covariation signal that we are looking for, we hope to address this limitation in the next 6 months and then go on to experimentally test some of these designed proteins to test the design methodology.

Protein-protein interactions and modelling of protein multimers

Residue covariation signals have also been observed in protein interfaces. Ongoing work in the lab aims at predicting the presence/absence of protein-protein interactions, based on residue covariation data. Work is also underway to develop methods to predict structures of homomultimeric complexes of proteins, using extensions to our successful DMPfold approach (described above). Eventually, we hope to extend these ideas to the prediction of heteromeric complexes as well.

Protein disorder and interactions with nucleic acids

Large fractions of eukaryotic proteomes are known or predicted to be disordered, and disorder is known to be associated with specific biological functions, such as DNA/RNA binding, transcriptional and translational regulation, and cell cycle regulation. Flexible regions of protein structures often contain clusters of covarying residues which appear to be under selection to maintain the molecule's ability to undergo specific conformational changes. Residue covariation analysis of such regions has revealed that signals corresponding to multiple, alternate conformational states can be detected. Using this information, we aim to release new tools to predict these alternative conformations and where possible, complexed with DNA/RNA sequences.
DMPfold is already beyond the state-of-the-art of one year ago. Further developments of the method should allow more accurate models to be produced from shallower sequence alignments. Many of these developments are coming about from the use of deep learning techniques that we had not envisaged using when the project was originally proposed, and the rapid development of the AI field is allowing us to progress the tools developed in the project more rapidly that we expected. All of these tools will be released as open source and so will benefit the wider community directly and immediately. We are hoping to deploy some of the tools in the form of Web servers, so that experimentalists can benefit from these advances even without a high level of technical knowledge.

The main new developments in modelling that we expect to come about going forward will be incorporating domain modelling, and inter-chain modelling options. This will allow DMPfold to be applicable to large proteins, comprising multiple domains and/or multiple chains. These new developments will be tested in the next CASP experiment (CASP14) scheduled to start in 2020. Tackling protein-protein modelling will require new software to be developed to handle the concatenation of alignments of separate families. So far we have one new in house method, which we are currently benchmarking against 3 other approaches proposed in the literature. We expect to publish the new method along with these results early next year.

Finally, we expect to carry out the planned wet-lab experiments in the next 12 months to demonstrate our ability both to design covariation signals into artificially designed proteins, and to generate sequences which can be used to supplement naturally occurring protein sequences to enhance covariation analysis results.
Outline flow diagram of the ProCovar DMPfold method.