Skip to main content

Big Data in Chemistry

Periodic Reporting for period 2 - BIGCHEM (Big Data in Chemistry)

Reporting period: 2018-01-01 to 2019-12-31

BIGCHEM ( was a Marie Skłodowska-Curie (MC) European Industrial Doctorate (EID) Industrial Training Network (ITN) dedicated to the new interdisciplinary field at the interfaces of chemistry, informatics and life sciences - Big Data in Chemistry. The main goal of BIGCHEM was the training of a new generation of scientists skilled to face the challenges and to exploit the opportunities of the field. The ultimate aim was to lead the transformation of the drug discovery field in order to accelerate the development of new therapies for human diseases. Both these goals were successfully achieved.
The work in the BIGCHEM project was structured into ten interrelated research topics designed to develop new technologies for “Big Data” analysis across the preclinical phases of the drug discovery, starting with the target validation phase up to the lead optimization and profiling phases.

Within the target discovery and validation phase of drugs discovery, BIGCHEM exploited Big Data in machine-learning models for compound activity prediction (ESR1, ESR2), with a special emphasis on model interpretation (ESR1). The work of BIGCHEM on new technologies for chemical high-throughput virtual screening (ESR2, ESR3, ESR4) resulted in an improvement of the screening process efficiency, specifically with respect to time and cost. This work also contributed to a deeper understanding of compound promiscuity, the property of chemical compounds to activate multiple targets, and additionally developed tools to detect and filter promiscuous compounds with unwanted compound activity (ESR9, ESR4, ESR5), providing better interpretation of the results of screening assays.

Furthermore, the BIGCHEM fellows created innovative in silico methods for the visualization and analysis of large-scale datasets of compounds (ESR3, ESR4), facilitating the detection of new chemical leads of high-throughput screening campaigns.

The influence of the BIGCHEM fellow’s scientific outcome on the lead optimization and de novo design phase was also remarkable. The proposal of new generative models for the creation of new compounds with desired properties (ESR6, ESR7, ESR8) and the development of innovative approaches for the planning of synthetic routes (ESR7, ESR9) was key for the transformation and advance of this phase.

BIGCHEM also contributed to the creation of new data sharing methodologies, in particular methods to share ADMETox properties by means of graph convolutional deep neural networks and by using Molecule Matched Pairs (ESR10).

The above-mentioned BIGCHEM outcomes were presented at 65 scientific conferences and events and resulted in 52 publications (see The project also co-organised the Strasbourg Summer School on Chemoinformatics in 2018 and International Conference on Neural Networks (ICANN2019) in Munich in 2019 as well as edited a special issue on “Big data in Chemistry” at the Journal of Cheminformatics. The impact of BIGCHEM on the scientific community has been really outstanding: its publications were cited more than 600 times only in 2018 according to Google Scholar and include one “Hot paper” as well as four “highly cited” articles (source Web of Science). All ESRs were enrolled at the PhD programs of the respective Universities. Three fellows have already received their PhD degrees and the others are working towards it.
The BIGCHEM’s work made a significant impact on the drug discovery field by the creation of novel in silico approaches to accelerate the discovery of novel chemical leads, the starting point for the discovery of novel chemical therapies.

More specifically, of practical relevance in pharmaceutical research was the contribution of BIGCHEM on improving the understanding the decisions of complex machine-learning models is (ESR1). This was also true for the new developed strategies to combine data sets from heterogeneous sources, which expands application scenarios for machine learning-based predictive modeling in drug discovery (ESR2). The virtual screening tools developed by BIGCHEM fellows (ESR3) were successfully used in various academic and industrial projects. In particular, the Hierarchical Generative Topographic Mapping (GTM) Zooming approach was applied to compare large chemical libraries and to search for unique chemotypes in Boehringer Ingelheim GmbH & Co KG.

Without a doubt, the promiscuity filters developed to target specific assay technologies are of practical use in drug discovery, since they perform better than the generic ones (ESR4). The newly created tools and databases by BIGCHEM fellows (ESR6) will help researchers to work with the Big Data. Methodology developed within the project facilitated identification of new compounds for synthesis, new drug candidates and in general, allowed better understanding of and navigation through chemical space. The software tool created to predict the synthetic feasibility of reactants (ESR7) will help the drug discovery field to increase the synthetic accessibility of the molecular designs, will speed up, and will facilitate the identification of candidates for chemical synthesis in the wet-lab. These methodologies (ESR9) were already successfully validated to design synthetic routes for compounds in internal drug discovery projects AstraZeneca. The dissemination of the developed methodology in open access articles is allowing its widespread use by other interested partners, including academy, SMEs and large industry.

The new generative models for the creation of new compounds with desired properties (ESR6, ESR7, ESR8) developed by BIGCHEM will support future drug discovery projects by providing novel pharmaceutically relevant molecules with desired properties such as polypharmacology (ESR9) in a cost- and time-efficient manner.

The review articles of BIGCHEM partners directed to non-specialized and non-experimented audiences will contribute to the public awareness of the impact and possibilities of this new field to improve human health.
Different similarity searching techniques give diverse results in a virtual screening analysis
Workflow used and the results generated during the course of the studies
Chemical data in pharma can be used in many studies of drug discovery
Compounds interfering or not with a technology are identified in counter-screen assays.
Schematic visualization of the method proposed for generation of molecules
Combining the bioactivity (orange) with the structural (blue) fingerprints improved predictions
Chemical Data Analysis using Generative Topographic Mapping (GTM)
A schema of how the scaffold generator model is able to complete scaffolds
Different computational methods used to flag potentially unwanted compounds in a HTS screening.
Overview of drug-disease-target connections for polypharmacology
Schematic representation of Edge Memory Neural Network (EMNN)
Similarity maps of the GDB4c database colored by numbers of atoms (left) and of molecules per pixel.
Prioritization of chemical reactants that bear suitable and compatible reactive functional groups
Number of different sources contributing results for a given ChEMBL endpoint
Schematic visualization of the interpretation method proposed for compound activity predictions.