Periodic Reporting for period 1 - AISym4MED (Synthetic and scalable data platform for medical empowered AI)
Periodo di rendicontazione: 2022-12-01 al 2024-05-31
This platform will address data privacy and security by combining new anonymization techniques, attribute-based privacy measures and trustworthy tracking systems. Data quality evaluation measures and model inspection methods will be available to identify shortcomings in the different stages of the AI models’ development pipeline, such as biased or unreliable datasets and models, and on-demand controlled data synthesis will be provided to address potential limitations. Real-world and synthetic data quality assessment and human-centred design for validation purposes will also be implemented to guarantee the representativeness of the platform’s datasets. Furthermore, this platform will exploit federated technologies to allow the secure usage of private data from closed borders, promoting indirect access to a broader number of databases, while respecting the privacy, security and General Data Protection Regulation requirements.
The proposed platform will support the development of new robust artificial intelligence-based solutions for health and streamline their integration in clinical scenarios. By leveraging distributed tools, digital technologies and state-of-the-art AI approaches, it will benefit researchers, innovators, patients and providers of health services, while maintaining a high level of data privacy and its ethical usage.
This platform will be validated against local, national and cross-border use cases targeting different types of stakeholders (data scientists and engineers, artificial intelligence software developers, researchers, clinical professionals, among others), to validate its different functionalities and its usability in real-world settings.
A major focus was on designing an intuitive and user-centred interface for the platform. This process involved engaging with end-users and stakeholders to gather feedback and refine design concepts. Initial mock-ups were created and iteratively improved based on user input, resulting in a refined and functional front-end design ready for implementation. This approach ensured that the platform's user interface aligns with the needs and workflows of its target audience.
Backend development progressed significantly, with key achievements including the initial implementation of a federated learning system and backend modules for data management. These components support seamless data flow, machine learning model training, and evaluation. The development of these systems involved integrating various technical elements to handle complex data interactions and ensure the platform's efficiency and scalability.
The dataset generation and evaluation capabilities of the platform were enhanced, including the creation of sophisticated tools for data synthesis and quality assessment. This work involved developing modules that support data ingestion, exploration and anonymization, as well as implementing techniques for generating synthetic data. These advancements ensure that the platform can handle diverse data types and maintain high standards of data privacy and usability. Some advancements in the development of model inspection techniques to assess predictive machine learning models were also achieved, with the implementation of both state-of-the-art techniques and the proposal of innovative approaches to identify model limitations, which will be the basis of the platform’s model auditing functionality.
Throughout the project, a strong emphasis was placed on ensuring the platform's compliance with legal, ethical, and security standards. Comprehensive risk assessments and security measures were integrated into the platform's development lifecycle, addressing potential vulnerabilities and ensuring robust protection of sensitive data. This holistic approach to security and compliance will be crucial for the successful deployment and operation of the platform in real-world scenarios.
Another major achievement is the synthetic data generation capability, which addresses the challenge of acquiring and utilizing real-world data for model training. The project has pioneered methods for generating high-quality synthetic datasets that effectively mimic real-world scenarios while safeguarding privacy, which can be directly used to mitigate AI model shortcomings arising from data-related limitations. The impact of these advancements extends beyond immediate technical improvements. The integration of model auditing and synthetic data generation enhances the overall reliability and applicability of machine learning models, generating evidence that can support their suitability for real-world scenarios and subsequently fostering greater trust and acceptance among users and stakeholders.
To ensure further uptake and success, several key needs must be addressed. Continued research into refining federated learning implementation and model auditing techniques will be crucial for adapting to evolving data landscapes and regulatory requirements. Demonstrating the effectiveness of synthetic data in real-world applications will help validate its benefits and encourage broader adoption. Engaging with industry partners and conducting pilot projects will also be essential for aligning the technology with practical needs and establishing its value in various sectors.