Skip to main content
Vai all'homepage della Commissione europea (si apre in una nuova finestra)
italiano it
CORDIS - Risultati della ricerca dell’UE
CORDIS

Explainable AI for Molecules - AiChemist

Periodic Reporting for period 1 - AiChemist (Explainable AI for Molecules - AiChemist)

Periodo di rendicontazione: 2023-09-01 al 2025-08-31

Europe’s medicines and chemicals sectors face twin pressures: accelerating discovery while increasing safety, transparency and sustainability. Recent developments in AI promise dramatic gains, yet “black-box” predictions are difficult to trust, reproduce or regulate. AiChemist addresses this issue by making molecular and reaction modelling explainable from the start and by co-designing methods with industrial and regulatory stakeholders so that results are scientifically robust, actionable for chemists, and acceptable to assessors. Today’s models often fail outside their training domain, struggle to generalise across data modalities (small molecules, proteins, reactions), and rarely communicate why a prediction should be believed. At the same time, experimental campaigns (e.g. reaction optimisation) are costly and carbon-intensive. AiChemist (https://aichemist.eu(si apre in una nuova finestra)) addresses these gaps with open, benchmarked AI methods that couple representation learning with mechanistic and quantum-aware reasoning, and with a training programme that equips 14 DCs (doctoral candidates) to carry these practices into industry and academia.

The main objectives of AiChemist are:
1. Develop and benchmark explainable molecular, reaction and protein representations that improve accuracy, speed and applicability domain versus conventional physics-based/ML baselines.
2. Advance mechanistic and quantum-informed models (e.g. reaction-outcome predictors, QM-derived descriptors) to ground AI decisions in chemical theory.
3. Bridge AI outputs and chemical intuition through practical explainable AI (XAI) workflows for toxicity, drug response and reaction design—including uncertainty, multi-objective trade-offs, and human-interpretable rationales.
4. Validate on public and proprietary datasets, release open, privacy-aware tools.
5. Train DCs through coordinated schools and secondments spanning academia and pharma, with the involvement of regulators in the supervisory board, ensuring durable uptake and technology transfer.

By improving trust, portability and efficiency of AI across discovery pipelines, AiChemist aims to reduce experimental iterations and compute budgets; enable safer medicines and chemicals via interpretable toxicity predictions; protect proprietary data while encouraging model exchange; and cultivate a new cohort of researcher-innovators fluent in XAI, open science and responsible research. The expected gains—faster, cheaper and greener design with explanations that chemists and regulators can use—position AiChemist to contribute to Europe’s strategic goals for innovation, safety and sustainability.
In WP1 we benchmarked privacy risks in published models using LiRA/RMIA attacks and found graph-based encodings with MPNNs balance accuracy and privacy; we also introduced MolEncoder and showed masking ratio choices, not just scale, drive embedding quality and compute efficiency. We proposed a metric for in-context computation density (Multiple Token Divergence) and have prepared an autoregressive mass-spectrometry model for molecular elucidation. We further established trustworthiness benchmarks for explainers on protein language models, and, for small molecules, compared SMILES-based XAI methods (IG/SHAP/DeepLIFT vs. Occlusion/Grad-CAM) to identify consistent, chemically meaningful signals for downstream chemotype design.
In WP2, an automated meta-MD workflow rediscovered the full catalytic cycle of a Buchwald–Hartwig coupling, an important first for complex organometallic mechanisms, and is now being extended to challenging substrates. We curated large condition datasets (~120k amide, ~50k Buchwald–Hartwig, ~20k Suzuki) and learned “condition fingerprints” that cluster settings yielding similar outcomes; combined with CGR reaction features, these embeddings improve feasibility modelling and enable virtual condition screening for practical recommendation. We also delivered multifidelity workflows that automate SMILES→DFT descriptors and a pharmacophore-representation calculator to support low-data selection.
In WP3, we produced a deployable multi-target nano-QSAR (NanoToxRadar) and initiated a deep model of gut-microbiome drug metabolism using PLM-derived bacterial embeddings, advanced explainable toxicity modelling by benchmarking five XAI methods on SMILES encoders to establish faithful, consistent attributions, then using the most reliable signals—augmented with global physicochemical/QM descriptors (e.g. HOMO–LUMO gap, logP)—to steer generation of new chemotypes for mutagenicity and cardiotoxicity. We have also built subpopulation-aware cardiotoxicity models from curated FAERS datasets using rigorous nested cross-validation and design choices optimised against DICTrank, and introduced an interpretable multi-instance learning workflow (“MILK”) that quantifies conformer importance for activity to support trustworthy explanations.
Furthermore, AiChemist has co-led crowdsourcing challenges, i.e. the Tox24 and 2nd Joint EU-Openscreen/SLAS Challenges, which accelerate AI4Science research by uniting diverse expertise to develop, benchmark, and validate models on real-world scientific data, fostering innovation and reproducibility.
1. We quantified privacy risks in published molecular ML models, showing that some representations leak training data while graph+MPNN models strike a better accuracy–privacy balance—informing safer model sharing and IP protection. 2. MolEncoder demonstrates that higher masking ratios, rather than brute-force scale, drive better embeddings and that performance gains saturate beyond moderate model/data sizes, pointing to lower compute and carbon costs without accuracy loss. 3. We achieved the first automatic re-discovery of the full catalytic cycle of a Buchwald–Hartwig coupling via meta-MD, a step-change for data-driven mechanism elucidation (TRL4). 4. New reaction-condition embeddings (“fingerprints”) cluster settings that yield similar outcomes across large amide/Buchwald–Hartwig/Suzuki datasets and enable virtual condition screening; combined with CGR reaction features, they improve feasibility modelling for practical recommendation (TRL3). We also delivered multifidelity workflows automating SMILES→DFT descriptors and a pharmacophore-representation calculator to support robust selection under sparse data (TRL2–4). In analytics and safety, a transformer for mass-spectrometry achieves state-of-the-art molecular elucidation on two benchmarks, accelerating structural ID, while an explainable multi-target nano-QSAR (NanoToxRadar; TRL5) and image-based toxicity/MoA studies strengthen decision-making earlier in the pipeline.
Overview of Work on Developing Hybrid Chemotypes for Toxicity Prediction
Reaction Network Created using Meta-MD Approach
Explainability in Protein Language Models
Comparison of MolEncoder Performance to Existing Models
State-of-the-art Language Model based on a Transformer Architecture for MS elucidation
Overview of the NanoToxRadar Platform
Multi-Instance Learning Pipeline
Il mio fascicolo 0 0