Periodic Reporting for period 1 - SYNTHEMA (Synthetic generation of hematological data over federated computing frameworks)
Période du rapport: 2022-12-01 au 2024-05-31
SYNTHEMA aims to address these challenges by establishing a cross-border health data hub for rare HDs, leveraging innovative AI-based techniques for data anonymisation and synthetic data generation (SDG). This platform will utilize privacy-preserving federated learning (FL) networks with secure multi-party computation (SMPC) and differential privacy (DP), connecting various stakeholders to enhance translational and clinical research.
The project will contribute to existing data registries like the European Rare Blood Disorders Platform (ENROL) and the European Platform on Rare Disease Registration (EU RD Platform) by providing data standards and shareable assets. SYNTHEMA focuses on six strategic objectives:
1. Develop methods for generating synthetic multimodal clinical, omics, and imaging data for HDs, focusing on sickle-cell disease (SCD) and acute myeloid leukaemia (AML), validated through advanced AI algorithms and tested for privacy risks.
2. Create de-identification, minimisation, and anonymisation pipelines, assessing their privacy levels to support clinical research and care, ensuring data utility while protecting privacy.
3. Enhance FL applications, SMPC, and DP solutions for privacy-preserving algorithm training and model aggregation, connecting health data centres and computing centres to train SDG algorithms.
4. Ensure ethical and GDPR compliance in data-driven research, developing frameworks and guidelines for ethical AI generation and data privacy, monitored by an Ethics Advisory Board (EAB).
5. Promote wide adoption and scalability of methodologies and tools through stakeholder engagement, dissemination, and open science practices, involving healthcare professionals, academia, industry, and patient communities.
By achieving these objectives, SYNTHEMA will advance the field of HD research, improving data availability and utility while ensuring ethical standards and privacy protections, thereby enhancing the overall management and understanding of rare haematological diseases.
WP1 concentrated on the collection, harmonisation, and interoperability of data. This involved designing clinical use cases for sickle cell disease (SCD) and acute myeloid leukaemia (AML), collecting comprehensive datasets from 12 hospitals, and creating a common data model. The team established a data quality plan and developed metadata models mapped to the OMOP Common Data Model (CDM) to enhance data interoperability and facilitate future integrations with other health datasets.
WP2 focused on developing a federated learning (FL) infrastructure, which included gathering requirements, designing, and deploying the FL platform. This platform supports secure multi-party computation (SMPC) and differential privacy (DP) techniques, ensuring that data analysis can be performed without compromising individual privacy. The team completed the technical and user requirement elicitation for the platform and an initial design of the platform architecture, while the development of the different components is ongoing.
WP3 was dedicated to creating data anonymisation and synthetic data generation pipelines. The aim was to balance data privacy with utility, allowing synthetic datasets to be used in place of real ones without significant loss of information quality. This WP developed initial versions of these pipelines and assessed their effectiveness.
WP4 involved the clinical validation and statistical utility assessment of the generated data. This included defining clinical validation metrics and applying a structured synthetic validation framework (SVF) to ensure that the synthetic data maintained its relevance and accuracy for clinical use. Th team elaborated a first version of the SVF that will be further refined in the coming months.
WP5 addressed privacy and security assessment. The team developed a privacy assessment framework to ensure compliance with applicable privacy legislation. This framework was essential in maintaining the integrity and confidentiality of the data throughout the project.
WP6 focused on outreach, exploitation, and collaboration. This involved engaging with stakeholder communities, disseminating findings, and ensuring the sustainability of the project's platform. The team conducted workshops, focus groups, and other outreach activities to foster collaboration and promote the project's outcomes.
WP7 managed the overall coordination and management of the project. This included overseeing scientific and operational activities, ensuring quality assurance, and mitigating risks. Additionally, WP7 defined and implemented an ethics management framework to guide ethical co-creation, monitoring, and assessment processes.
It has contributed to global health data standards by collecting and harmonising extensive datasets for SCD and AML, which are now being standardised through the OMOP CDM. This effort supports broader healthcare research and clinical applications by enhancing data interoperability and quality. The development of the federated learning infrastructure and the associated privacy-preserving techniques ensures that sensitive health data can be analysed and utilised without compromising patient confidentiality.
Moreover, the creation of a health data hub and collaborative platform will facilitate data sharing and collaboration among clinical centres and public repositories. This platform supports various data-driven applications, including training anonymisation and synthetic data generation pipelines, which are crucial for advancing medical research while safeguarding patient privacy.
Future efforts will focus on refining data standards, enhancing the federated learning infrastructure, and further developing the anonymisation and synthetic data generation pipelines. Additionally, the project will continue its robust dissemination and collaboration efforts to maximize the impact of its findings and technologies.