CORDIS - Forschungsergebnisse der EU
CORDIS

Molecule design for next generation solar cells using machine learning approaches trained on large scale screening databases

Periodic Reporting for period 2 - MolDesign (Molecule design for next generation solar cells using machine learning approaches trained on large scale screening databases)

Berichtszeitraum: 2020-01-01 bis 2020-12-31

The development of new materials and molecules is of crucial importance for many global challenges, including new technologies for green energy and new chemicals to be used as drugs. Artificial intelligence and machine learning led to major breakthroughs in fields such as natural language processing and computer vision during the last decade. In natural sciences, in particular in materials science and chemistry, similar breakthroughs are possible but require the development of task specific machine learning methods. This project addresses multiple aspects of this challenging task. Firstly, the project aims to develop and apply novel representations of materials and molecules that can be used for discriminative and generative tasks, i.e. the virtual prediction of materials properties based on existing data, and the automated, computer-aided design of novel materials to be tested using simulations or experiments. Secondly, the project ultimately aims at generating new scientific understanding from data-driven approaches such as machine learning, which intrinsically are numerical procedures that are hard to interpret. Thirdly, the project aims to apply these machine learning methods to materials and technologies that are relevant for global challenges, e.g. to energy applications (e.g. organic solar cells).
During the outgoing phase of the project, Pascal Friederich worked in the group of Alan Aspuru-Guzik, first at Harvard University and then at the University of Toronto, where the group moved in July 2019. The work focused on various aspects of machine learning methods applied to chemistry and materials science. Some of the main results include:
1) The development of multiscale simulation workflows that include machine learning models to accelerate materials simulations and enable the analysis of larger and more realistic systems. Because many application-relevant materials properties depend on various length and time scales which cannot be simulated by just one single simulation method, multiscale simulation methods are mandatory to virtually screen materials to find promising candidates for synthesis and experimental characterization.[1]
2) The development of novel representations for generative models such as variational autoencoders or GANs. The conventional, string based SMILES representation is easily readable for humans, but prone to semantical and syntactical errors. This makes it hard for machine learning models to generate new smiles codes that correctly encode molecules. In this project, we developed SELFIES, which is an alternative string representation of molecules, which are intrinsically 100% robust, i.e. every SELFIES string corresponds to a valid molecule. This new representations shows a significantly improved performance for generative machine learning models, which enables a wide range of new application possibilities in design of materials, molecules and drugs.[2]
3) Machine learning with application to questions in the physical sciences has become a widely used tool, successfully applied to classification, regression and optimization tasks in many areas. Research focus mostly lies in improving the accuracy of the machine learning models in numerical predictions, while scientific understanding is still almost exclusively generated by human researchers analysing numerical results and drawing conclusions. In this work, we shift the focus on the insights and the knowledge obtained by the machine learning models themselves. In particular, we study how it can be extracted and used to inspire human scientists to increase their intuitions and understanding of natural systems.[3]

During the return phase, Pascal Friederich not only worked on his MSCA project, but also established a research group at the Karlsruhe Institute of Technology and worked as a Tenure-Track Professor in Informatics on Artificial Intelligence for Materials Science. Within the return phase of his MSCA project, Pascal Friederich started a collaboration with the group of Prof. Brabec at the University of Erlangen-Nürnberg on the machine learning driven development of materials and molecules for solar cells. Within that collaboration, Pascal Friederich is developing machine learning based property prediction and decision making workflows to select and guide automated experimental efforts in synthesis and characterization of organic semiconductors. Furthermore, Pascal Friederich worked on the implementation of a graph neural network library [4] to act as a basis for future research projects and on a second collaboration project with the group of Steven Lopez at Northeastern University which is based on the methods and results obtained in the outgoing phase of the project.[5]

Results and exploitation
During the return phase of the MSCA project, Pascal Friederich actively worked on establishing collaborations to exploit the results I obtained during the outgoing phase and to use and further extend the methods I developed at Harvard University and at the University of Toronto. Applications for third party funding with academic and industrial collaboration partners for scientific projects which will build on methods developed within the MolDesign project were successful, and two international collaboration projects, both funded by the BMBF (Germany) and NRC (Canada), as well as one German project started in 2021, involving the industry parterns Cynrora GmbH in Germany, Nanomatch GmbH in Germany and Miru Smart Technologies in Canada.

[1] Pascal Friederich et al. 2020, MLST 2 01LT01.
[2] Mario Krenn et al. 2020, MLST 1 045024.
[3] Pascal Friederich et al. 2021, MLST 2 025027.
[4] Patrick Reiser et al. 2021, Software Impacts 100095.
[5] Jingbai Li et al. 2021, Chemical Science 12 5302-5314.
The project lead to developments beyond the state of the art in multiple aspects, including the aforementioned areas, namely machine learning methods integrated in multiscale materials modeling approaches, molecule representations for machine learning methods, and interpretable machine learning models. During the MSCA project, in particular during the return phase, Pascal Friederich and his collaboration partners brought those newly developed tool closer to applications. They applied machine learning enhanced multiscale modeling simulations to a variety of materials which are relavant for organic electronics, incl. organic solar cells. SELFIES are currently widely adapted by the community and we expect to see promising results, in particular for the drug design. Due to the closed nature of industrial research, the exact impact will be hard to quantify, but direct feedback from collaboration partners in industry indicates a high potential for success. The development of interpretable machine learning methods is not only of interest for the material science community, but sparks interest in almost any application domain of machine learning, including applications that are much closed to everyday life than materials science. While promising progress on interpretability can be observed in the field of computer vision, this project focused on graph neural networks, which are of growing interest not only for social graphs and recommendation systems, but also for materials science and chemistry, and layed a foundation for the future scientific success of Pascal Friederich and his newly established AiMat research group (https://aimat.science) at the Karlsruhe Institute of Technology.
Artificial intelligence and machine learning for molecules, materials and applications