Periodic Reporting for period 1 - SYNTHIA (Synthetic Data Generation Framework for Integrated Validation of Use Cases and AI Healthcare Applications)
Reporting period: 2024-09-01 to 2025-08-31
SYNTHIA's overall objective is to build trust and accelerate the adoption of Synthetic Data—artificially generated data that statistically mimics real patient records without compromising individual privacy. The project is developing and rigorously validating cutting-edge Synthetic Data Generation tools and methods for diverse medical data types, including imaging, genomic, and clinical notes.
Our focus is on six high-impact diseases: Lung Cancer, Breast Cancer, Multiple Myeloma, Diffuse Large B-cell Lymphoma, Alzheimer’s Disease, and Type 2 Diabetes Mellitus. Through dedicated technical use cases across the disease areas, SYNTHIA will provide the scientific evidence needed to show that results derived from synthetic data are as reliable as those from real world data, ensuring a proper balancing between data utility and patient privacy.
The project’s pathway to impact involves creating a comprehensive synthetic Data Evaluation Framework and a sustainable, federated synthetic data publishing platform. This platform will offer researchers certified, fit-for-purpose synthetic datasets and validated synthetic data generation tools. By establishing this widely accessible resource, SYNTHIA will significantly accelerate the development of Artificial Intelligence-based diagnostic and prognostic tools, enable faster clinical trials (for example, using synthetic control arms), and contribute substantially to the emerging European Health Data Space. This initiative is positioned to leverage the expected massive growth in the synthetic data market, strengthening Europe’s leadership in data-driven personalised medicine.
In parallel, the ethical and legal framework was consolidated through the completion of the Data Protection Impact Assessment, ensuring that General Data Protection Regulation -compliant data processing protocols were in place from the start.
At the same time, the core federated infrastructure was deployed and connected to three data provider nodes. The platform is now operational, enabling secure, distributed model training and synthetic data generation without compromising the privacy of real data.
To guarantee high data quality and interoperability, data standardisation protocols based on the Observational Medical Outcomes Partnership - Common Data Model, were implemented. In addition, an extensive state-of-the-art review of synthetic data generation methods was completed. The insights gained from this review informed the definition of detailed technical specifications and quality metrics for selecting and adapting advanced Synthetic Data Generation models capable of handling the complex, multimodal, and longitudinal data required for clinical validation across all six use cases.
Operationalisation of the Federated Learning Infrastructure and establishment of a robust ethical and legal framework: the simultaneous achievement of technical deployment and legal clearance represents a critical step forward compared to the State of the Art, where obtaining General Data Protection Regulation-compliant, multi-centre access often delays projects for years. This milestone validates the feasibility of securely using high-value, Real Data for Synthetic Data Generation, effectively transitioning the platform from a theoretical framework (Technology Readiness Levels 3-4) to an operational environment.
Development of clinically driven master protocols for Synthetic Data Generation development: The comprehensive definition of clinical, data, and application requirements across all six use cases (WP 3) has produced an application-driven blueprint ensuring that the Synthetic Data Generation tools under development are both fit-for-purpose and clinically relevant. This methodological rigor directly addresses a key limitation of current State of the Art approaches: the absence of standardised frameworks to guarantee that generated synthetic data preserves application-specific utility and clinical validity.
Foundation for large-scale data standardization and interoperability: The implementation of Observational Medical Outcomes Partnership- Common Data Model-based data standardisation protocols across the first federated nodes (WP 4) proactively resolves major interoperability challenges that typically hinder European health data initiatives.
Generation and validation of first synthetic data models: the first synthetic data models for breast cancer, multiple myeloma, and Alzheimer’s disease were successfully developed and validated, marking a major technical milestone. In parallel, the technology team established and tested synthetic data generation methods tailored to key conditions such as multiple myeloma and type 2 diabetes mellitus.