CORDIS - EU research results
CORDIS

DeepZyme: Learning Deep Representations of Enzymes for Predicting Catalytically-Beneficial Mutations

Periodic Reporting for period 1 - DeepZyme (DeepZyme: Learning Deep Representations of Enzymes for Predicting Catalytically-Beneficial Mutations)

Reporting period: 2021-01-01 to 2022-12-31

During the course of evolution nature has created and optimized extraordinary protein catalysts, named enzymes, that are fundamental in all reigns of life. Enzymes facilitate complex chemical reactions at physiological conditions, accelerating their rates by several orders of magnitude and being highly selective over alternative –undesired– chemical transformations. Understanding how enzymes work and how to engineer their functions is essential for many disciplines, with applications ranging from medical therapies to biotechnological devices. The main challenge towards the rational control of enzymes is that given their complexity, it is not trivial to predict modifications –known as mutations– that are beneficial for their activity. The DeepZyme project aims to develop a model for the prediction of such modifications, taking advantage of revolutionary techniques in the field of deep learning. We propose to obtain condensed “representations” of enzymes by leveraging their sequence, structure and catalytic information. These representations can be suitably designed to describe enzymatic information that is available in nature, and learn how enzymes have been tuned by selection pressures along evolution. Navigating in the space of enzyme representations will allow us to finely tune their properties, and thereby guide a rational design process. Our model will be used together with other state-of-the-art techniques (including molecular dynamics, Markov state models and quantum mechanics / molecular mechanics) to generate from scratch an enzyme able to catalyze chemical reactions along the synthesis of drug-like molecules.
We started by gathering and curating a complete and comprehensive dataset that encompasses enzyme substrates, sequences, structures, and catalytic constants (kcat) from SABIO-RK, Uniprot, PubChem, and AlphaFold databases. The download and filtering of the data was automated to enable easy updates at any time. In total, we collected ~10.000 entries to train, validate, and test a deep learning model. Before that, we selected alternative tasks with larger datasets that could be useful to find appropriate sequence representations and network architectures. Specifically, we investigated the effect of protein embeddings on the prediction of brightness in green fluorescent proteins (GFP regression task, ~50.000 entries), and also network architectures on the prediction of first-level enzyme commission numbers (EC classification task, ~70.000 entries). For the GFP regression task, we evaluated the performance of sequence representations obtained from pre-trained evolutionary models, including Unirep (mLSTM model), SeqVec (biLSTM model), and ProtBERT and ESM-1B (both transformer models). We found that a simple one-hot encoding of the sequence was competitive and more data efficient than all the tested representations. For the EC classification task, we tested architectures using as inputs both sequence-only (convolutional neural networks, CNN) and structural information (graph neural networks, GNN). Our results showed that the two networks performed excellently well for predicting EC numbers, both with high classification accuracies. We then trained both the CNN and GNN networks on our curated dataset, aiming to predict log10[kcat] values (regression task). Despite the limited size of our dataset and the complexity of the factors that affect kcat values, we were able to achieve satisfactory correlations between predicted and ground truth values, even though model generalization was difficult outside the space of representative substrates, sequences, and structures. Finally, we addressed model interpretability by implementing an attention mechanism to the structural GNN model, and mapping attention weights onto known binding and catalytic residues. These analyses revealed that the network learned meaningful structural patterns, leading to deep enzymatic representations that could be used for different tasks. As part of the project, we also aimed to design an enzyme able to catalyze an SNAr reaction involved in the synthesis of a relevant drug, with the goal of reducing the environmental impact of current synthetic approaches. To that end, we computed the transition state (TS) of the reaction using the nudged elastic band method as implemented in ORCA, using density functional theory to describe the system. Afterwards, we obtained thermostable structures from the protein data bank (PDB) to use them as scaffolds for the TS. We then devised a design protocol that optimizes a multi-objective function (PyRosetta) and checks the stability of the designs using molecular dynamics (OpenMM). We tested this protocol with the scaffold of a thermostable azoreductase enzyme, finding variants with improved in silico properties. Additionally, during the project we established collaborations with experimental groups to work on the mechanistic understanding of enzymes, including a human receptor related to SARS-CoV-2 viral infection and a novel glycosyl transferase that is able to synthesize drug-like molecules and natural products.
By implementing the DeepZyme project we made significant progress in the development of state-of-the-art computational tools for the understanding and engineering of enzymes. The networks we devised allow for the identification of binding and catalytic residues, as well as for the obtention of deep representations in a latent space where catalytic constants are properly arranged. We anticipate that in the near future, technological advancements and the acquisition of more experimental data will enable the efficient design of enzymes in silico, reducing the costs of experimental testing and the environmental impact they have. Furthermore, the design protocol we devised allowed us to approach the generation of enzymes for catalyzing an SNAr reaction mechanism, which is widely used in the production of several drugs. A successful design would represent the perfect green catalyst for the reaction, making it possible in environmentally friendly solvents and enhancing its production efficiency. Ultimately, our contributions in the understanding of enzymes led to the identification of potent inhibitors to combat SARS-CoV-2, with direct impact on medicine and public health, and also paved the way for tailoring enzymes to synthesize industrially relevant products, which will increase the competitiveness of the European Union in the rapidly expanding market of biotechnology.
toc-figure.png