Periodic Reporting for period 4 - ProCovar (Exploring new applications of amino acid covariation analysis in modelling proteins and their complexes)
Berichtszeitraum: 2021-05-01 bis 2022-04-30
Covariation methods aim at predicting contacts in protein structures. Methods, such as our own PSICOV, work well but require many diverse, homologous proteins in order to achieve good results. We developed methods that significantly extended the range of cases in which good predictions can be made by using AI (deep learning) methods to learn patterns in smaller protein families. We found these models could significantly outperform traditional methods and our tool DeepMetaPSICOV (DMP) was ranked highly in the CASP13 experiment held in 2018. The software is freely available via GitHub.
Large-scale de novo structure modelling (DMPfold and DMPfold2)
The ultimate goal of predicting contacts is to predict protein structures from sequence. Building on our earlier work, we developed AI-based predictors of inter-residue distances, backbone angles and hydrogen bonds. Using these as constraints, we are able to accurately predict structures for a wide array of proteins including membrane proteins, and achieves excellent results. Predictions are also accompanied by accurate estimates of their likely correctness. Perhaps most crucially, the method is fast enough to be run on whole genomes, allowing us to expand the structural coverage of proteomes of importance to biological research. Again, the software is available via GitHub.
Protein design/Synthetic Biology
New developments in machine learning allow the generation of new examples of images, objects or other data, given large training sets of similar entities. Work done in ProCovar has shown that it is possible to use the same technology to modify existing protein sequences in order to introduce novel functionality or features, such as metal binding sites, into novel proteins This opens the door to a wide variety of protein design tasks, and work aimed at these designs in the laboratory is now underway. Initially, our models had not produced the desired covariation signals that we were looking for, we were able to deal with this eventually by combining machine learning language models and AlphaFold. To evaluate this, we managed to express 12 synthetic genes in the lab, and have found that 6 of these produce soluble product, which is a relatively high rate of success. Work to further characterize these 6 proteins biophysically is currently underway, and we expect to publish these results soon.
Protein-protein interactions and modelling of protein multimers
Residue covariation signals have also been observed in protein interfaces. Ongoing work in the lab aims at predicting the presence/absence of protein-protein interactions, based on residue covariation data. Work is also underway to develop methods to predict structures of homomultimeric complexes of proteins, using extensions to our successful DMPfold approach. We applied a number of tools developed within the ProCovar grant to annotate a minimal bacterial genome and produced both single chain and multimeric structures for the complete proteome. This work allowed us to evaluate the state-of-the-art in protein modelling for all levels of protein structure, including quaternary structure. These results have been made publicly available via a BioRxiv preprint.
Protein disorder and interactions with nucleic acids
Many parts of eukaryotic proteomes are disordered, and disorder is known to be often associated with specific biological functions, such as DNA/RNA binding, transcriptional and translational regulation, and cell cycle regulation. Flexible regions of proteins often contain clusters of covarying residues which appear to be under selection to maintain the molecule's ability to undergo specific conformational changes. To address some of these questions we developed a completely differentiable model of protein folding using GPU hardware. This implements coarse-grained molecular dynamics in which the force parameters can be learned from data, including both ordered and disordered protein regions. This work has been published and related software made available via GitHub.
By combining language models of protein sequences and AlphaFold, we put together the first framework for designing arbitrary new protein structures using the recently released AlphaFold model. We have tried to express 12 synthetic proteins and have found 6 produce soluble product, which is a very high level of success in the de novo protein design field. Work is continuing to further characterise these sequences and to try to understand why the other half failed.