Exploring new applications of amino acid covariation analysis in modelling proteins and their complexes

Periodic Reporting for period 4 - ProCovar (Exploring new applications of amino acid covariation analysis in modelling proteins and their complexes)

Berichtszeitraum: 2021-05-01 bis 2022-04-30

ProCovar aimed to investigate novel applications of amino acid covariation in proteins. In recent years, massive sequencing projects have resulted in huge numbers of protein sequences, spread across many diverse organisms. The latest sequence data banks now have hundreds of millions of sequences, and this volume of data alllows much more sophisticated evolutionary analysis to be carried out. One of the most exciting of these new approaches is to move away from so called Markov models of protein evolution, where mutations to each residue in a protein family are modelled independently of changes occurring at other residue positions. With very large collections of sequences, deeper patterns emerge, and we are able to detect situations where residues "co-evolve" i.e. changes in one position does influence the likelihood of observing changes at other positions. This covariational data has been shown to have very exciting applications in the modelling of protein structure, the prediction of gene function and the inference of interactions between proteins. Key outputs of this project have been a range of computational tools which we have made freely available as open source software, and which can be used by both theoreticians and experimentalists alike. Given the large number of possible applications of covariation data, and the continual growth of sequence data banks making more and more families tractable for covariation analysis, successful completion of this whole programme of work could have significant impact across almost all areas of biomedicine, particularly as both transmembrane and disordered proteins are so hard to study by any other means than bioinformatics approaches. Overall, ProCovar has allowed us to better understand both the benefits and limitations of using amino acid covariation for protein modelling, and as a rather unexpected development has also contributed to the progress in using deep learning (AI) techniques for protein modelling problems.

Prediction of inter-residue contacts (DeepCov and DeepMetaPSICOV)

Covariation methods aim at predicting contacts in protein structures. Methods, such as our own PSICOV, work well but require many diverse, homologous proteins in order to achieve good results. We developed methods that significantly extended the range of cases in which good predictions can be made by using AI (deep learning) methods to learn patterns in smaller protein families. We found these models could significantly outperform traditional methods and our tool DeepMetaPSICOV (DMP) was ranked highly in the CASP13 experiment held in 2018. The software is freely available via GitHub.

Large-scale de novo structure modelling (DMPfold and DMPfold2)

The ultimate goal of predicting contacts is to predict protein structures from sequence. Building on our earlier work, we developed AI-based predictors of inter-residue distances, backbone angles and hydrogen bonds. Using these as constraints, we are able to accurately predict structures for a wide array of proteins including membrane proteins, and achieves excellent results. Predictions are also accompanied by accurate estimates of their likely correctness. Perhaps most crucially, the method is fast enough to be run on whole genomes, allowing us to expand the structural coverage of proteomes of importance to biological research. Again, the software is available via GitHub.

Protein design/Synthetic Biology

New developments in machine learning allow the generation of new examples of images, objects or other data, given large training sets of similar entities. Work done in ProCovar has shown that it is possible to use the same technology to modify existing protein sequences in order to introduce novel functionality or features, such as metal binding sites, into novel proteins This opens the door to a wide variety of protein design tasks, and work aimed at these designs in the laboratory is now underway. Initially, our models had not produced the desired covariation signals that we were looking for, we were able to deal with this eventually by combining machine learning language models and AlphaFold. To evaluate this, we managed to express 12 synthetic genes in the lab, and have found that 6 of these produce soluble product, which is a relatively high rate of success. Work to further characterize these 6 proteins biophysically is currently underway, and we expect to publish these results soon.

Protein-protein interactions and modelling of protein multimers

Residue covariation signals have also been observed in protein interfaces. Ongoing work in the lab aims at predicting the presence/absence of protein-protein interactions, based on residue covariation data. Work is also underway to develop methods to predict structures of homomultimeric complexes of proteins, using extensions to our successful DMPfold approach. We applied a number of tools developed within the ProCovar grant to annotate a minimal bacterial genome and produced both single chain and multimeric structures for the complete proteome. This work allowed us to evaluate the state-of-the-art in protein modelling for all levels of protein structure, including quaternary structure. These results have been made publicly available via a BioRxiv preprint.

Protein disorder and interactions with nucleic acids

Many parts of eukaryotic proteomes are disordered, and disorder is known to be often associated with specific biological functions, such as DNA/RNA binding, transcriptional and translational regulation, and cell cycle regulation. Flexible regions of proteins often contain clusters of covarying residues which appear to be under selection to maintain the molecule's ability to undergo specific conformational changes. To address some of these questions we developed a completely differentiable model of protein folding using GPU hardware. This implements coarse-grained molecular dynamics in which the force parameters can be learned from data, including both ordered and disordered protein regions. This work has been published and related software made available via GitHub.

The latest version of our DMPfold method was published in 2022, and remains the fastest protein structure prediction method known. Although model accuracy is not as high as AlphaFold, it still produces quite reasonable models but is more than 1000x faster. It can produce a single model in less than half a second on a modern GPU (Graphical Processing Unit). This huge increase in speed is very useful for both protein design work, where sequences can be very rapidly screened for compatibility with a fold, and for scanning large scale genome databases for novel folded protein coding regions. In our paper we released over a million likely correct protein models for putative novel families found in metagenome sequence data banks. Until the most recent update of the AlphaFold database in July 2022, this remained the largest available set of predicted models using machine learning tools.

By combining language models of protein sequences and AlphaFold, we put together the first framework for designing arbitrary new protein structures using the recently released AlphaFold model. We have tried to express 12 synthetic proteins and have found 6 produce soluble product, which is a very high level of success in the de novo protein design field. Work is continuing to further characterise these sequences and to try to understand why the other half failed.

Outline flow diagram of the ProCovar DMPfold2 method.

Periodic Reporting for period 4 - ProCovar (Exploring new applications of amino acid covariation analysis in modelling proteins and their complexes)

Diese Seite teilen

Herunterladen