Synthetic Data Generation Framework for Integrated Validation of Use Cases and AI Healthcare Applications

Project Information

SYNTHIA

Grant agreement ID: 101172872

DOI

10.3030/101172872

EC signature date 9 August 2024

Start date 1 September 2024

End date 31 August 2029

Funded under

Health

Total cost

€ 22 415 540,00

EU contribution

€ 12 438 775,00

12 438 775,00

9 976 765,00

Coordinated by

FUNDACION PARA LA INVESTIGACION DEL HOSPITAL UNIVERSITARIO LA FE DE LA COMUNIDAD VALENCIANA
Spain

Periodic Reporting for period 1 - SYNTHIA (Synthetic Data Generation Framework for Integrated Validation of Use Cases and AI Healthcare Applications)

Reporting period: 2024-09-01 to 2025-08-31

The SYNTHIA project addresses a critical challenge in modern healthcare: the difficulty of accessing high-quality patient data due to incompleteness, stringent privacy laws and complex regulations. This scarcity of real-world data severely limits the training of robust Artificial Intelligence models, slowing down medical research and thereby development of personalised treatments.

SYNTHIA's overall objective is to build trust and accelerate the adoption of Synthetic Data—artificially generated data that statistically mimics real patient records without compromising individual privacy. The project is developing and rigorously validating cutting-edge Synthetic Data Generation tools and methods for diverse medical data types, including imaging, genomic, and clinical notes.

Our focus is on six high-impact diseases: Lung Cancer, Breast Cancer, Multiple Myeloma, Diffuse Large B-cell Lymphoma, Alzheimer’s Disease, and Type 2 Diabetes Mellitus. Through dedicated technical use cases across the disease areas, SYNTHIA will provide the scientific evidence needed to show that results derived from synthetic data are as reliable as those from real world data, ensuring a proper balancing between data utility and patient privacy.

The project’s pathway to impact involves creating a comprehensive synthetic Data Evaluation Framework and a sustainable, federated synthetic data publishing platform. This platform will offer researchers certified, fit-for-purpose synthetic datasets and validated synthetic data generation tools. By establishing this widely accessible resource, SYNTHIA will significantly accelerate the development of Artificial Intelligence-based diagnostic and prognostic tools, enable faster clinical trials (for example, using synthetic control arms), and contribute substantially to the emerging European Health Data Space. This initiative is positioned to leverage the expected massive growth in the synthetic data market, strengthening Europe’s leadership in data-driven personalised medicine.

During the first year of the SYNTHIA project, all essential foundational and technical infrastructures were successfully established. The clinical data and application requirements for the six use cases were comprehensively defined.

In parallel, the ethical and legal framework was consolidated through the completion of the Data Protection Impact Assessment, ensuring that General Data Protection Regulation -compliant data processing protocols were in place from the start.

At the same time, the core federated infrastructure was deployed and connected to three data provider nodes. The platform is now operational, enabling secure, distributed model training and synthetic data generation without compromising the privacy of real data.

To guarantee high data quality and interoperability, data standardisation protocols based on the Observational Medical Outcomes Partnership - Common Data Model, were implemented. In addition, an extensive state-of-the-art review of synthetic data generation methods was completed. The insights gained from this review informed the definition of detailed technical specifications and quality metrics for selecting and adapting advanced Synthetic Data Generation models capable of handling the complex, multimodal, and longitudinal data required for clinical validation across all six use cases.

During its first year, SYNTHIA advanced well beyond the original State of the Art by successfully overcoming several major logistical, technical and legal bottlenecks. Key achievements include:

Operationalisation of the Federated Learning Infrastructure and establishment of a robust ethical and legal framework: the simultaneous achievement of technical deployment and legal clearance represents a critical step forward compared to the State of the Art, where obtaining General Data Protection Regulation-compliant, multi-centre access often delays projects for years. This milestone validates the feasibility of securely using high-value, Real Data for Synthetic Data Generation, effectively transitioning the platform from a theoretical framework (Technology Readiness Levels 3-4) to an operational environment.

Development of clinically driven master protocols for Synthetic Data Generation development: The comprehensive definition of clinical, data, and application requirements across all six use cases (WP 3) has produced an application-driven blueprint ensuring that the Synthetic Data Generation tools under development are both fit-for-purpose and clinically relevant. This methodological rigor directly addresses a key limitation of current State of the Art approaches: the absence of standardised frameworks to guarantee that generated synthetic data preserves application-specific utility and clinical validity.

Foundation for large-scale data standardization and interoperability: The implementation of Observational Medical Outcomes Partnership- Common Data Model-based data standardisation protocols across the first federated nodes (WP 4) proactively resolves major interoperability challenges that typically hinder European health data initiatives.

Generation and validation of first synthetic data models: the first synthetic data models for breast cancer, multiple myeloma, and Alzheimer’s disease were successfully developed and validated, marking a major technical milestone. In parallel, the technology team established and tested synthetic data generation methods tailored to key conditions such as multiple myeloma and type 2 diabetes mellitus.

Periodic Reporting for period 1 - SYNTHIA (Synthetic Data Generation Framework for Integrated Validation of Use Cases and AI Healthcare Applications)

Download Download the content of the page