Periodic Reporting for period 2 - AISym4MED (Synthetic and scalable data platform for medical empowered AI)
Reporting period: 2024-06-01 to 2025-11-30
This platform will address data privacy and security by combining new anonymization techniques, attribute-based privacy measures and trustworthy tracking systems. Data quality evaluation measures and model inspection methods will be available to identify shortcomings in the different stages of the AI models’ development pipeline, such as biased or unreliable datasets and models, and on-demand controlled data synthesis will be provided to address potential limitations. Real-world and synthetic data quality assessment and human-centred design for validation purposes will also be implemented to guarantee the representativeness of the platform’s datasets. Furthermore, this platform will exploit federated technologies to allow the secure usage of private data from closed borders, promoting indirect access to a broader number of databases, while respecting the privacy, security and General Data Protection Regulation requirements.
The proposed platform will support the development of new robust artificial intelligence-based solutions for health and streamline their integration in clinical scenarios. By leveraging distributed tools, digital technologies and state-of-the-art AI approaches, it will benefit researchers, innovators, patients and providers of health services, while maintaining a high level of data privacy and its ethical usage.
This platform will be validated against local, national and cross-border use cases targeting different types of stakeholders (data scientists and engineers, artificial intelligence software developers, researchers, clinical professionals, among others), to validate its different functionalities and its usability in real-world settings.
Data auditing is anchored in pyMDMA (Multimodal Data Metrics for Auditing) library, providing a standardised framework for validating both real and synthetic medical images, time series, and tabular data. Regarding model auditing, new modules for explainability, fairness and bias mitigation, uncertainty estimation, and privacy have been implemented to quantify and correct model performance across protected demographic groups, assess confidence in predictions, and evaluate privacy risks, ensuring the development of trustworthy AI. GASTeN framework for stress-testing models against edge-case data was conceptualised.
Regarding synthetic medical data generation, the project has moved beyond preliminary testing to deliver high-fidelity generative models across multiple modalities and targeting the project’s use cases. Key breakthroughs include a controllable model for retinal fundus images and specific generative models for ECG, EEG, and clinical tabular data. A quantitative approach to evaluating synthetic data was proposed, unifying the dimensions of fidelity, diversity, privacy, and utility. Specific additional metrics to evaluate synthetic clinical time series were conceptualised. To bridge the gap between quantitative metrics and clinical trust, the project launched the "Doctor-in-the-Loop" evaluation workflow.
To ensure trustworthiness and data privacy, the project evolved its framework into a functional, multi-layered security architecture, based on robust legal foundation for cross-border processing.
To iterative validation of functionalities was set. Early pairing of technical partners with use-case owners, yielded the essential "building blocks" for validation, including data dictionaries, feature extraction methods, and predictive and generative models tailored to real-world clinical scenarios.
The project has explored high-fidelity generative models using Stable Diffusion and Pathology-controlled generation, and proposed specific adaptations for small real datasets, showing preliminary promise in improving disease classification by mitigating real-world data scarcity. Further advancements, such as GASTeN and the "Doctor-in-the-Loop" workflow, are intended to provide evidence on the reliability and clinical applicability of machine learning models, with the goal of supporting their suitability for real-world scenarios as validation continues. These technical advancements are complemented by a multi-layered governance architecture, designed to ensure that data handling remains strictly GDPR-compliant.
Future efforts will focus on demonstrating the consistency of these features in broader operational environments and refining exploitation strategies to ensure long-term scalability and market alignment. To ensure further uptake and success, continued research into federated learning and model auditing will be crucial for adapting to evolving data landscapes and regulatory requirements. Demonstrating the effectiveness of synthetic data in real-world applications will help validate its benefits and encourage broader adoption. Engaging with industry partners and conducting pilot projects will also be essential for aligning the technology with practical needs and establishing its value in various sectors.