Skip to main content
European Commission logo
italiano italiano
CORDIS - Risultati della ricerca dell’UE
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary

Decoding, Mapping and Designing the Structural Complexity of Hydrogen-Bond Networks: from Water to Proteins to Polymers

Periodic Reporting for period 4 - HBMAP (Decoding, Mapping and Designing the Structural Complexity of Hydrogen-Bond Networks: from Water to Proteins to Polymers)

Periodo di rendicontazione: 2020-11-01 al 2021-04-30

Several of the compounds that are most crucial for life, and that underlie crucial societal challenges from health to energy are held together by a stable yet labile chemical bond that involves two negatively charged atoms and one hydrogen atom - the so-called hydrogen bond. To emphasize the versatile nature of the hydrogen bond and its ubiquity, suffices to say that water, DNA, proteins, several polymers such as kevlar, as well as most small organic molecules that are used as drugs, have a structure that is largely determined by hydrogen bonds.

One of the main reasons behind its staggering flexibility is the fact that hydrogen bonds rarely come alone, but often give rise to cooperative networks in which the total is much more than the sum of the parts. Understanding the complexity that arises when thousands of these relatively simple chemical units combine to form a protein or an extended crystals is an enormous challenge, that limits our ability to tune the behavior and performance of all of these materials.

Computer simulations can provide a significant help to elucidate the structure-property relations of H-bonded materials, by giving direct access to the behavior of individual atoms on a length scale of a billionth of a meter, and on a time scale of a less than a billionth of a second. In order to develop their full potential, however, simulations must improve to achieve greater levels of predictive accuracy, e.g. including a full treatment of the quantum mechanical nature of both electrons and light nuclei (such as hydrogen itself). Furthermore, there is great need to use techniques borrowed from research in artificial intelligence to sift through the enormous amount of data generated by large scale simulations. The objectives of HBMAP revolve around the use of machine-learning techniques to gain a better understanding of hydrogen-bonded materials, from water to drug molecules, and therefore clarify their structure-property relations and help designing more effective drugs, more resistant, lightweight or biodegradable materials.
HBMAP is based on a very fundamental, rigorous approach to the application of data-driven techniques to atomistic simulations. This has led to develop a systematic formulation of the problem of mapping atomic structures into the inputs of machine-learning models, and of the interplay between unsupervised (pattern recognition, visualization) and supervised (direct property prediction) techniques, that facilitates translating simulation data into an intuitive understanding of structure-property relations.

Several seminal papers have been published as a result of this project, discussing novel techniques including (1) a probabilistic analysis of molecular motifs (PAMM) scheme that is proving invaluable to identify recurring patterns in an atomistic simulation, such as hydrogen-bonding modes in water, protein folds and misfolds, and packing of molecules in a crystal; (2) symmetry-adapted Gaussian process regression (SA-GPR) that makes it possible to predict functional properties of materials that have a tensorial nature, such as those that determine the optical spectra of hydrogen-bonded systems; (3) the ATLAS technique to explore high-dimensional free-energy surfaces; (4) kernel principal covariate regression (KPCovR), a technique combining supervised and unsupervised learning to visualize and understand structure-property relations; (5) the long-distance equivariant framework (LODE) to incorporate long-range interactions in atomistic machine learning.

These methodological advances have helped shed light into the behavior of different classes of hydrogen-bonded materials. From aqueous systems, for which HBMAP provided both improved simulation strategies based on machine-learning, and an automated analysis of the hydrogen-bond network and the synthesizability of ice structures, to molecular materials for which we could predict properties and represent the structure-energy-property landscapes with unprecedented clarity, to (bio)polymers where we could provide an automatic recognition of structural patterns in proteins and self-assembled systems, HBMAP has delivered its promise of using data-driven approaches to rationalize the emergent behavior of hydrogen-bonded materials.

The methodological work in HBMAP has also resulted in a wealth of open-source and open science contributions. These include librascal (a library to build representations of atomic structures for ML), i-PI (a modular tool for advanced molecular simulations), scikit-cosmo (a collection of utilities to apply automated data analysis to molecular simulations. In many cases, demonstrations of the tools we developed are available as on-line applications (http://shiftml.org http://alphaml.org http://chemiscope.org) reducing even further the barrier to translate HBMAP developments into publicly usable instruments of discovery.
HBMAP has taken place right at the time in which applications of machine learning to atomic-scale simulations have evolved from a niche area of research to a mainstream modeling technique, and has played a substantial role into establishing the gold standard for these developments.
The rigorous characterization of the mathematical mapping of structures into features that is the prerequisite for the application of automated data analysis, the combination of supervised and unsupervised techniques, the careful assessment of the uncertainty of ML predictions and the combination of data driven techniques with the traditional tools of molecular dynamics and statistical mechanics have been pioneered in this project.

Analyses of simulations of materials, particularly those as complex as proteins or molecular crystals, often rely on heuristic rules or empirical principles to rationalize structure-property relations. This project has made it possible to interpret simulations and experiments based on less biased approaches that rely on the analysis of correlations present in the data, rather than on preconceived notions based e.g. on prior knowledge on similar systems. For instance, this has allowed us to objectively assess the extent of hydrogen bonding in water, and to recognize the link between packing motifs in molecular crystals and their electronic properties. A better understanding of the interplay between data analytical techniques and materials modeling has allowed us to draw an atlas of the known solid phases of water, and to propose more than 20 new candidates that show substantial promise for being synthesized.
Self-assembly of biomimetic polymers, and the structure of coated nanoparticles, provide a demonstration of how the HBMAP project succeeded in targeting the emergent behavior that arise due to the cooperative behavior of the H-bond network.

While the focus of HBMAP has been, in line with the proposal, on hydrogen-bonded materials, we have also shown how the techniques we developed can be applied to other classes of complex materials, including for example porous silicate frameworks (zeolites) which are extremely important as catalysts and separation mediums. The foundations that have been laid by the successful completion of the HBMAP project supported (and will support) several synergistic and follow-up projects, often in collaboration with industrial partners, that provide a direct path to the transfer of knowledge to more applied, and technologically urgent, domains.
alphaml-logo.jpg
Machine learning separated molecules that are active or inactive binders to a given protein
NMR chemical shieldings for a complex molecular material using the ShifTML machine learning model
A probabilistic analysis of molecular motifs to recognize recurring patterns in (bio)molecules