Periodic Reporting for period 1 - MARMOTTE (Tessellation-based analysis of dynamic protein structures and their complexes - MoleculAR MOTions meet TEssellations (MARMOTTE).)
Reporting period: 2023-09-01 to 2025-08-31
The MARMOTTE project aimed to address these challenges by improving computational analysis of protein structures and their complexes through the exploration and prediction of structural heterogeneity of interatomic contacts. The project focused on a Voronoi tessellation-based approach to interaction analysis, which has been shown to be more descriptive than traditional pairwise distance-based methods. Specifically, the project had three scientific objectives: to develop methods that (1) efficiently compute tessellation-derived contact areas in dynamic structures represented as sets of conformational snapshots, (2) predict how these contact areas change upon motion, and (3) use the predicted statistical properties of contact areas to estimate protein-protein binding energy scores suitable for ranking and selecting complex models.
The researcher developed an automated workflow for collecting and quantifying contact area heterogeneity from sequence-clustered ensembles in the Protein Data Bank (PDB). From these data, area persistence values were derived for each unique contact, defining ground truth for classifying contacts as stable or unstable. Data were divided into training, validation, and test sets. A Voronota-LT-based subarea calculation algorithm was created to divide atom-atom contact areas into layers (by distance from the solvent boundary) and sectors (by atomic directions), generating fine-grained contact descriptors for machine learning. Using the training data, the researcher estimated contact-type probabilities of occurrence and persistence and discovered that these probabilities are not highly correlated. Their combined use could therefore potentially benefit protein structure assessment tasks. The researcher derived heterogeneity-informed statistical pseudo-energy coefficients and used them to compute pseudo-energy values serving as classifier input features. The researcher introduced the Voronoi Contacts Block (VCBlock) descriptor summarizing an inter-residue contact and its neighbors in a permutation-invariant vector form, enabling neural network training on contact-level properties. A VCBlock-based neural network classifier was trained to predict whether a contact area in a protein structure is stable or unstable within an ensemble. Tested on unseen PDB data, it achieved 0.78 accuracy. A standalone software tool, VoroMarmotte, was developed to apply this classifier.
The researcher demonstrated that the VoroMarmotte method for predicting contact stability can be used to assess protein-protein complex predictions by aggregating contact-level outputs into global interface scores. Native and high-quality models consistently showed higher predicted persistent interface areas. Further work defined a contact area persistence-based pseudo-energy score and applied it to the data from the EGFR Protein Design Competition, showing that it can be instrumental in distinguishing binders from non-binders among designed proteins. A computational binder optimization pipeline was also built to propose mutations improving interface stability based on the new pseudo-energy scoring.
Additionally, the researcher developed a contact area-based statistical potential method, VoroChipmunk, that directly utilized observed contact area occurrence and persistence probabilities to score protein-protein interface predictions. The researcher also developed VoroIF-GNN-v2, a graph attention neural network for predicting interface quality on the level of residue-residue contacts. It works on tessellation-derived protein-protein interface graphs annotated with the VoroChipmunk-like descriptors. During CASP16-CAPRI in 2024, the researcher's scoring group "Olechnovic" employed VoroIF-GNN-v2 and demonstrated top performance in the CAPRI challenge, where it was ranked first in the CAPRI scoring category.
The researcher implemented all the developed methods as open-source software and made them openly and freely available under a permissive license. The core software, Voronota-LT, was made accessible in multiple ways: as libraries in different programming languages, a command-line application, and a web application. Also, thanks to established collaborations, the researcher implemented a Voronota-LT plugin that was included in the Faunus framework for Monte Carlo molecular simulations.
The researcher initiated steps to increase the uptake and success of the developed methods: exploring use cases in the field of protein design, preparing method-related publications, integrating the developed software into established protein structure analysis and modeling frameworks.