Periodic Reporting for period 2 - MolDesign (Molecule design for next generation solar cells using machine learning approaches trained on large scale screening databases)
Reporting period: 2020-01-01 to 2020-12-31
1) The development of multiscale simulation workflows that include machine learning models to accelerate materials simulations and enable the analysis of larger and more realistic systems. Because many application-relevant materials properties depend on various length and time scales which cannot be simulated by just one single simulation method, multiscale simulation methods are mandatory to virtually screen materials to find promising candidates for synthesis and experimental characterization.[1]
2) The development of novel representations for generative models such as variational autoencoders or GANs. The conventional, string based SMILES representation is easily readable for humans, but prone to semantical and syntactical errors. This makes it hard for machine learning models to generate new smiles codes that correctly encode molecules. In this project, we developed SELFIES, which is an alternative string representation of molecules, which are intrinsically 100% robust, i.e. every SELFIES string corresponds to a valid molecule. This new representations shows a significantly improved performance for generative machine learning models, which enables a wide range of new application possibilities in design of materials, molecules and drugs.[2]
3) Machine learning with application to questions in the physical sciences has become a widely used tool, successfully applied to classification, regression and optimization tasks in many areas. Research focus mostly lies in improving the accuracy of the machine learning models in numerical predictions, while scientific understanding is still almost exclusively generated by human researchers analysing numerical results and drawing conclusions. In this work, we shift the focus on the insights and the knowledge obtained by the machine learning models themselves. In particular, we study how it can be extracted and used to inspire human scientists to increase their intuitions and understanding of natural systems.[3]
During the return phase, Pascal Friederich not only worked on his MSCA project, but also established a research group at the Karlsruhe Institute of Technology and worked as a Tenure-Track Professor in Informatics on Artificial Intelligence for Materials Science. Within the return phase of his MSCA project, Pascal Friederich started a collaboration with the group of Prof. Brabec at the University of Erlangen-Nürnberg on the machine learning driven development of materials and molecules for solar cells. Within that collaboration, Pascal Friederich is developing machine learning based property prediction and decision making workflows to select and guide automated experimental efforts in synthesis and characterization of organic semiconductors. Furthermore, Pascal Friederich worked on the implementation of a graph neural network library [4] to act as a basis for future research projects and on a second collaboration project with the group of Steven Lopez at Northeastern University which is based on the methods and results obtained in the outgoing phase of the project.[5]
Results and exploitation
During the return phase of the MSCA project, Pascal Friederich actively worked on establishing collaborations to exploit the results I obtained during the outgoing phase and to use and further extend the methods I developed at Harvard University and at the University of Toronto. Applications for third party funding with academic and industrial collaboration partners for scientific projects which will build on methods developed within the MolDesign project were successful, and two international collaboration projects, both funded by the BMBF (Germany) and NRC (Canada), as well as one German project started in 2021, involving the industry parterns Cynrora GmbH in Germany, Nanomatch GmbH in Germany and Miru Smart Technologies in Canada.
[1] Pascal Friederich et al. 2020, MLST 2 01LT01.
[2] Mario Krenn et al. 2020, MLST 1 045024.
[3] Pascal Friederich et al. 2021, MLST 2 025027.
[4] Patrick Reiser et al. 2021, Software Impacts 100095.
[5] Jingbai Li et al. 2021, Chemical Science 12 5302-5314.