Periodic Reporting for period 1 - PHASE IV AI (Privacy compliant health data as a service for AI development)
Okres sprawozdawczy: 2023-10-01 do 2025-03-31
Artificial intelligence (AI) enables data-driven innovations in health care. AI systems, which process vast amounts of data quickly and in detail, show promise both as a tool for preventive health care and clinical decision-making. However, the distributed storage and limited access to health data form a barrier to innovation, as developing trustworthy AI systems requires large datasets for training and validation. Furthermore, the availability of anonymous datasets would increase the adoption of AI-powered tools by supporting health technology assessments and education.
Secure, privacy compliant data utilization is key for unlocking the full potential of AI and data analytics. Companies developing AI solutions would benefit from synthetic microdata for early-stage development, provided on-demand and with privacy guarantees. For researchers and clinicians interested in aggregate data or modelling, multi-party computation allows deriving insights from the distributed real-world data. In this way, providing synthetic data and multi-party computation as a service will boost data-driven innovation without compromising the privacy of data subjects.
PHASE IV AI will advance the current state-of-the-art data synthesis methods by developing a more generalized approach to synthetic data generation, while also creating robust metrics for testing and validation, aiming to
• Improve methods and technical pipelines for privacy-preserving data synthesis including different data formats such as Electronic Health Records (EHRs) and medical images.
• Provide easy to use and configurable data services to enable AI developers’ access to larger pools of decentralized de-identified data through multi-party computing.
• Provide anonymous data on demand or from a (temporary) repository.
• Establish a Data Market – facilitating data sharing and monetization including incentives-based provision of data to the services.
• Integrate the data market and the data service ecosystem as a X-European health data hub in the European Health Data Space, and
• Validate the results with real-world use-cases focusing on high impact diseases, cancer types in particular.
Specifically, the developments of PHASE IV AI project will be validated in 3 real life use cases in relevant high impact diseases comprising (i) Lung Cancer, (ii) Prostate Cancer, and (iii) Ischemic Stroke. All the three diseases are key topics of the European Health ecosystem. Lung Cancer as well as Prostate Cancer are among the top 3 priorities in tackling cancer, neurodegenerative diseases are one of the most relevant issues with the EU’s ageing population.
• A strong progress was achieved in defining user, data, legal, ethical, technical, and architectural requirements to support the PHASE IV AI project.
• The project developed and validated AI pipelines for generating synthetic health data. These tools are tailored to three use cases - lung cancer, prostate cancer, and ischemic stroke, incorporating privacy-preserving techniques such as differential privacy and diffusion models, to meet General Data Protection Regulation (GDPR) compliance.
• A structured data collection process was set up and key datasets requiring harmonization were identified early in the process. The project harmonized real-world datasets using the OMOP (Observational Medical Outcomes Partnership) Common Data Model, facilitating interoperability across countries and institutions. Legal and ethical frameworks were established to ensure compliance with GDPR and the EU AI Act.
• The development of synthetic data services under PHASE IV AI has focused on several complementary objectives: generating new data to augment cohort sizes, creating de-identified datasets for privacy-preserving analytics, imputing missing data from observed values, and ultimately simulating disease progression through synthetic data modeling.
• Foundation for federated model training and validation in real-world healthcare scenarios was created. Key achievements include the deployment and validation of a distributed secure multi-party computation (SMPC) network across eight partner organizations, the development of preliminary federated machine learning workflows, and the initial preparation of hardware acceleration strategies.
• A prototype of the Health Data Hub was designed, integrating services for anonymization, harmonization, and synthetic data generation. The project also explored decentralized infrastructure (DePIN) and blockchain-based certification to ensure trust and traceability.
• Two rounds of use case workshops for stakeholder were held across four countries, to engage and gather input from clinicians, researchers, industry, and regulators. These insights were translated into user stories and usage scenarios that guide system development.
• Pilot Plan includes the study protocols covering the three study use cases: lung cancer, prostate cancer, and ischemic stroke. These pilots will validate the utility of synthetic data and AI models in real clinical environments.
These include:
• Differential Privacy (DP)-compliant methods for tabular data (e.g. SDV-GaussianCopula, AIM, MWEM for EHR data).
• Latent diffusion models for CT-imaging showcase how realistic synthetic datasets can maintain utility while ensuring privacy.
To advance the conditions for the effective, cross-border utilization of real-world evidence through multi-party computation, the project has deployed a secure SMPC network across eight partner organisations, including both technical and clinical partners. A solution for a data market solution has been designed, allowing the creation of a marketplace for both data sets and data services as well as providing the infrastructure that allows a seamless and secure integration of data providers’ assets, respecting the very stringent constraints around data privacy by design. The integration of DePIN and blockchain technologies into the Health Data Hub introduces a novel approach to secure, auditable, and value-based data exchange. The project’s alignment with the EHDS and participation in the EU Blockchain Sandbox positions it as a reference model for future European health data initiatives.