Community Research and Development Information Service - CORDIS


BIGCHEM Report Summary

Project ID: 676434
Funded under: H2020-EU.1.3.1.

Periodic Reporting for period 1 - BIGCHEM (Big Data in Chemistry)

Reporting period: 2016-01-01 to 2017-12-31

Summary of the context and overall objectives of the project

BIGCHEM ( is a Marie Skłodowska-Curie (MC) European Industrial Doctorate (EID) Industrial Training Network (ITN), which is at interfaces between chemistry, computer and life sciences. The overall goals of the project is to integrate research and teaching activities across the academic and industrial institutions and to provide a “universal” education in Big Data analysis in chemistry. Since start the project results were presented at 15 conferences as well as in twelve articles (see

Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far

The work is structured in ten research projects that are interrelated:
ESR1: A variety of machine learning methods have been investigated for activity predictions using large compound data sets, which resulted in practical guidelines for training and prediction of active compounds. We also showed that different machine learning methods can contribute dissimilar sets of features, which are responsible for their performances. A methodology to select the features as well as possibility to interpret calculations was presented.
ESR2: studied the influence of different automated strategies for training set composition on model quality and identification of subsets that require closer/manual inspection. Our preliminary results indicate that modern machine learning (e.g. random forests) is relatively robust with respect to heterogeneous and noisy training sets. Moreover, it was possible to identify problematic subsets that had a negative influence on overall performance. These results could reduce the effort needed to train predictive models from heterogeneous data sets.
ESR3: The Generative Topographic Mapping (GTM) approach was used to compare three large databases: PDB17, ChEMBL17 (containing existing biologically active molecules) and FDB17 (containing chemical structures generated by computer) totalling 21M molecules. We identified new chemotypes in FDB17, which might be of interest to medicinal chemists in both academia and industry.
ESR4: The inhibition of luciferase firefly in biological assays could lead to false positive results, thus wasting time and resources. We performed a large-scale analysis of luciferase inhibition data using >300k measurements to identify such inhibitors. The model is available on-line at and can be used to avoid spurious results, thus speeding up the drug discovery process.
ESR5: The public ChEMBL (medicinal chemistry data) and PubChem (biological screening data) compound repositories have been analysed and searched for compounds with multi-target activities to infer drug targets on the basis of target annotations of structural analogues and to predict compounds with desirable multi-target activities. The analogue relationships captured by these data were generally insufficient for predicting compounds with desired multi-target activities. More complex machine learning prediction approaches will be further investigated with a particular focus on activity data from large pharma projects.
ESR6: contributed to and exploited virtual chemical databases to find new drug candidates. Firstly, a virtual enumeration of a part of the ring system chemical space, which provides new tools for chemists to find new and potential structures to synthetize, was done. A creation of software to perform searches in huge databases (>1 billion molecules) is currently undergoing.
ESR7: A library of manually curated chemical reactions was created. Figure ESR7_1 shows a beautiful network of all extracted chemical reactions, which are clustered into common reactant and product patterns. To date, 121 chemical reactions are ready to be used by an automated enumeration workflow which has been implemented (Enumeration Framework, Figure ESR7_2).
ESR8: The project is focusing on generating new molecular structures based on QSAR (Quantitative Structure-Activity Relationship), which can be interpreted as predicting compound activities from models (i.e. inverse-QSAR problem). The developed autoencoder neural networks can encode SMILES strings into latent variables and reconstruct SMILES from the latent variables vice versa. This kind of de novo molecular design method has the potential to fundamentally change the drug discovery process.
ESR9: The main research topic of ESR9 has been the development of a tool to predict polypharmacology based on big data. Different sets of protein pocket descriptors useful to predict polypharmacology have been calculated. The visualization of the different patches in the binding sit

Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)

ESR1: The machine learning will be extended to compound data sets from in-house pharma sources and further practical applications will be evaluated.
ESR2: The fellow will develop machine learning based iterative screening strategies for carrying out high throughput screening in a more efficient and resource saving manner.
ESR3: Future work will include the finalization of the virtual screening project as well as an in-depth analysis of GTM landscapes for compound collection comparisons.
ESR4: The fellows will work to develop new filters for the identification of frequent hitters using public and in-house data of partners.
ESR5: The analysis will be extended to biological screening data from AstraZeneca and a search will be carried out for frequent hitters to better understand the molecular basis of their apparent promiscuity.
ESR6: The creation of the developed tools and databases will help us to continue researching in the Big Data era. It will be easier to find new compounds for synthesis, new drug candidates and in general we will have a better understanding of the chemical space.
ESR7: As a prospective application, compounds with a desired profile will be identified, taking the example of mimetics of selected natural products with known target activities. De novo design studies will be enabled that can be tailored towards novel regions in chemical space while taking compound synthesizability into account.
ESR8: will be along the development of de novo molecule design method, which will combine recurrent neural network and reinforcement learning method to be considered for application in AZ internal drug discovery projects.
ESR9: The focus will be to develop deep learning methodologies to make retrosynthesis planning. It is expected that the methodologies will be used to design synthetic routes for compounds in AZ internal drug discovery projects and ones in the GDB database.
ESR10: The work will be focused on analysis of methods for secure data sharing in particular using new promising approaches, such as Deep Neural Networks.

Related information

Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top