The availability of a large-scale standardized dataset, which is representative for a problem of interest, constitutes a precondition for supervised machine learning projects. To date, this requirement hasn't been satisfied in the context of protein structure determination with NMR spectroscopy. Therefore, the elimination of this obstacle has been proposed as the first major endeavor within this Marie Skłodowska-Curie action. We have managed to establish an annotated dataset of NMR measurements, which is composed of 1329 spectra that allow 100 protein structures to be reproduced out of the original data.
In the second part of the project, we focused on automated visual analysis of the spectra. At first, we used over 600 000 manually annotated cross-peak examples to establish a deep residual neural network (ResNet) that automatically detects true signals in the recorded spectrum, distinguishing them from impurities and artifacts. Subsequently, we implemented a generator of synthetic NMR spectra fragments, which was used to train a second instance of the ResNet model, addressing the problem of deconvolution of highly-overlapping signals. Finally, chemical shifts deposited in the BMRB database were used to establish a kernel density estimator of cross-peaks positions, making it possible to unfold recorded signals. All three models, used sequentially, extract automatically signal frequencies from NMR spectra.
In the third part of the project, we integrated our approach to visual spectra analysis with methods implemented in CYANA. One of the key steps in this process was the development of a graph neural network, which captures dependencies between chemical shift values and supports interpretation of ambiguous signals. As a result, we obtained an end-to-end approach (ARTINA) that fulfills the main aim of the DeepNOE project by automating protein structure determination with NMR spectroscopy. Our method runs strictly without any human intervention, taking as input only the protein sequence and a set of NMR spectra. The method returns the intermediate spectra annotations, making it possible for a researcher to manually verify the automatically determined structure.
In the quantitative evaluation, ARTINA automatically determined 100 protein structures, which were compared with corresponding PDB depositions, yielding high agreement with 1.44 Å median backbone RMSD to reference. In this experiment, ARTINA demonstrated its ability to correctly assign 90.39% of the chemical shifts, as compared with BMRB depositions. The method handled more accurately the protein backbone (96.03% mean accuracy) than side-chains (86.50%) chemical shifts, which is mainly caused by difficulties in aromatic ring assignments (76.87%).
ARTINA is publicly available online as a service (SaaS) within the NMRtist system (
https://nmrtist.org(si apre in una nuova finestra)). The method has been presented at the EUROMAR conference (2021) and an Emerging MR Webinar. During practical sessions of the “Biomolecular NMR: Advanced Tools” workshop at Gothenburg University and a graduate course at Goethe University in Frankfurt, over 50 participants tested ARTINA in practice by solving automatically protein structures and assigning chemical shifts.