Synthetic generation of hematological data over federated computing frameworks

Informations projet

SYNTHEMA

N° de convention de subvention: 101095530

DOI

10.3030/101095530

Date de signature de la CE 10 Novembre 2022

Date de début 1 Decembre 2022

Date de fin 30 Novembre 2026

Financé au titre de

Health

Coût total

€ 6 514 560,00

Contribution de l’UE

€ 6 514 560,00

6 514 560,00

Coordonné par

UNIVERSIDAD POLITECNICA DE MADRID
Spain

Periodic Reporting for period 1 - SYNTHEMA (Synthetic generation of hematological data over federated computing frameworks)

Période du rapport: 2022-12-01 au 2024-05-31

Haematological diseases (HDs) encompass around 450 disorders involving abnormalities in blood cells, lymphoid organs, and coagulation factors, divided into oncological (e.g. lymphomas, myelomas, leukaemias) and non-oncological categories (e.g. hemoglobinopathies, haemolytic anaemias, coagulopathies). Effective management of HDs demands interdisciplinary collaboration, but existing national and EU-level initiatives (e.g. European Leukaemia Net, ERN-EuroBloodNet) often fall short in providing comprehensive, evidence-based guidance, particularly for rare HDs due to data scarcity and fragmentation. This lack of cohesive data hampers research, health planning, and the sustainability of patient registries.
SYNTHEMA aims to address these challenges by establishing a cross-border health data hub for rare HDs, leveraging innovative AI-based techniques for data anonymisation and synthetic data generation (SDG). This platform will utilize privacy-preserving federated learning (FL) networks with secure multi-party computation (SMPC) and differential privacy (DP), connecting various stakeholders to enhance translational and clinical research.
The project will contribute to existing data registries like the European Rare Blood Disorders Platform (ENROL) and the European Platform on Rare Disease Registration (EU RD Platform) by providing data standards and shareable assets. SYNTHEMA focuses on six strategic objectives:
1. Develop methods for generating synthetic multimodal clinical, omics, and imaging data for HDs, focusing on sickle-cell disease (SCD) and acute myeloid leukaemia (AML), validated through advanced AI algorithms and tested for privacy risks.
2. Create de-identification, minimisation, and anonymisation pipelines, assessing their privacy levels to support clinical research and care, ensuring data utility while protecting privacy.
3. Enhance FL applications, SMPC, and DP solutions for privacy-preserving algorithm training and model aggregation, connecting health data centres and computing centres to train SDG algorithms.
4. Ensure ethical and GDPR compliance in data-driven research, developing frameworks and guidelines for ethical AI generation and data privacy, monitored by an Ethics Advisory Board (EAB).
5. Promote wide adoption and scalability of methodologies and tools through stakeholder engagement, dissemination, and open science practices, involving healthcare professionals, academia, industry, and patient communities.
By achieving these objectives, SYNTHEMA will advance the field of HD research, improving data availability and utility while ensuring ethical standards and privacy protections, thereby enhancing the overall management and understanding of rare haematological diseases.

The project is structured into seven work packages (WPs), each targeting specific objectives related to data collection, infrastructure development, data processing, validation, privacy, outreach, and project management. In M1-M18, the project concluded all its requirement elicitation, preliminary assessment and state-of-the-art review activities, and the core development activities started and are currently ongoing.
WP1 concentrated on the collection, harmonisation, and interoperability of data. This involved designing clinical use cases for sickle cell disease (SCD) and acute myeloid leukaemia (AML), collecting comprehensive datasets from 12 hospitals, and creating a common data model. The team established a data quality plan and developed metadata models mapped to the OMOP Common Data Model (CDM) to enhance data interoperability and facilitate future integrations with other health datasets.
WP2 focused on developing a federated learning (FL) infrastructure, which included gathering requirements, designing, and deploying the FL platform. This platform supports secure multi-party computation (SMPC) and differential privacy (DP) techniques, ensuring that data analysis can be performed without compromising individual privacy. The team completed the technical and user requirement elicitation for the platform and an initial design of the platform architecture, while the development of the different components is ongoing.
WP3 was dedicated to creating data anonymisation and synthetic data generation pipelines. The aim was to balance data privacy with utility, allowing synthetic datasets to be used in place of real ones without significant loss of information quality. This WP developed initial versions of these pipelines and assessed their effectiveness.
WP4 involved the clinical validation and statistical utility assessment of the generated data. This included defining clinical validation metrics and applying a structured synthetic validation framework (SVF) to ensure that the synthetic data maintained its relevance and accuracy for clinical use. Th team elaborated a first version of the SVF that will be further refined in the coming months.
WP5 addressed privacy and security assessment. The team developed a privacy assessment framework to ensure compliance with applicable privacy legislation. This framework was essential in maintaining the integrity and confidentiality of the data throughout the project.
WP6 focused on outreach, exploitation, and collaboration. This involved engaging with stakeholder communities, disseminating findings, and ensuring the sustainability of the project's platform. The team conducted workshops, focus groups, and other outreach activities to foster collaboration and promote the project's outcomes.
WP7 managed the overall coordination and management of the project. This included overseeing scientific and operational activities, ensuring quality assurance, and mitigating risks. Additionally, WP7 defined and implemented an ethics management framework to guide ethical co-creation, monitoring, and assessment processes.

The SYNTHEMA project has made significant strides in its initial 18 months, contributing to global health data standards, creating a collaborative health data hub, and ensuring privacy and security in data processing.
It has contributed to global health data standards by collecting and harmonising extensive datasets for SCD and AML, which are now being standardised through the OMOP CDM. This effort supports broader healthcare research and clinical applications by enhancing data interoperability and quality. The development of the federated learning infrastructure and the associated privacy-preserving techniques ensures that sensitive health data can be analysed and utilised without compromising patient confidentiality.
Moreover, the creation of a health data hub and collaborative platform will facilitate data sharing and collaboration among clinical centres and public repositories. This platform supports various data-driven applications, including training anonymisation and synthetic data generation pipelines, which are crucial for advancing medical research while safeguarding patient privacy.
Future efforts will focus on refining data standards, enhancing the federated learning infrastructure, and further developing the anonymisation and synthetic data generation pipelines. Additionally, the project will continue its robust dissemination and collaboration efforts to maximize the impact of its findings and technologies.

synthema-logo-horizontal.png

Periodic Reporting for period 1 - SYNTHEMA (Synthetic generation of hematological data over federated computing frameworks)

Partager cette page Partager cette page sur les réseaux sociaux

Télécharger Télécharger le contenu de la page