Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary

AI powered Data Curation & Publishing Virtual Assistant

Periodic Reporting for period 1 - AIDAVA (AI powered Data Curation & Publishing Virtual Assistant)

Reporting period: 2022-09-01 to 2024-02-29

The AIDAVA project addresses a critical gap in healthcare data management: the seamless integration of high-quality personal health data (PHD) from multiple sources, making it interoperable, AI-ready, and reuse-ready across institutions on a national and EU scale. The core objective of AIDAVA is to develop and test an AI-powered virtual assistant that can automate data curation and publishing of both unstructured and structured, heterogeneous health data. This virtual assistant will empower patients and their delegates to manage their health data from various sources, including hospitals, general practitioners, patient-reported outcomes systems, and medical devices.
AIDAVA has made significant progress in developing its first-generation virtual assistant prototype, designed to aid patients regardless of their health and digital literacy.
Collaborating with medical experts, AIDAVA established two use cases: the Federated Breast Cancer Registry and Risk Monitoring for Myocardial Infarction Patients. These cases involved mapping out scenarios, selecting data sources, defining data flow compliance with privacy standards and identifying evaluation metrics.
The team finalised the requirements, design, and architecture of the prototype. Development, including AI-based curation tools, is underway. In parallel, a study protocol was developed to test the prototype across three hospital sites; approval from one ethical committee was already secured. Patient feedback was provided by eight patients from two associations who were involved in designing the prototype and defining the protocol, and who expressed a keen interest in the success of the project.
To achieve the core objective of AIDAVA, automation in data curation and quality enhancement, the team identified 11 interoperability issues. For each issue, the team developed a workflow enabling automatic resolution by incorporating AI-based curation tools and human-in-the-loop dialogs. A data source onboarding tool was developed to map attributes from data sources to one of workflows or directly to the target ontology. A Data Quality Framework was established, and implemented on the curated data through Data Quality checks on consistency and completeness.
In terms of natural language processing, the team annotated documents in Estonian, Dutch, and German to train NLP and Entity Linking models.
The project also integrated a structured governance process to work with an existing ontology (SPHN), planning to expand to a family of ontologies in the second prototype generation to ensure long-term sustainability.
This phase of AIDAVA has showcased its ability to tackle complex data curation and interoperability challenges, laying a strong foundation for a virtual assistant that could transform personal health data management.
AIDAVA's innovative approach to data curation and interoperability offers several results that surpass current state-of-the-art solutions in different areas.
AIDAVA leverages AI to automate health data curation, enhancing data quality and efficiency while reducing human error. In addition, by introducing an interim step in the health data life cycle - the Personal Health Knowledge Graph - AIDAVA facilitates the reuse of data across multiple datasets without repeated curation. This shift from the costly "curate many times, use once" model to a more efficient "curate once, use many times" approach significantly improves the management of heterogeneous health data.
Data Quality is a topic addressed in many papers, yet there are not many instruments in place to measure and improve health data quality due to the heterogeneity of these data at the source. By centering curation on an integrated health record in a standardised format (the Personal Health Knowledge Graph), AIDAVA opens a new approach by enabling verification of data quality at the level of the patient and enabling derivation of data quality labels on published datasets. This approach was defined in the AIDAVA Data quality framework and is being successfully implemented with the SHACL validator.
Most NP models have been developed for English. AIDAVA is developing models for 3 European languages - one of which (Estonian) spoken by less than 1.5 million inhabitants - and will be looking at multilingual models. To optimise the needed annotation effort, the project developed annotation guidelines that have already been discussed outside of the project and will be published to the wider NLP community; the resulting annotated datasets will be available - conditional to Ethical Committee approval - for other researchers. The development of NLP models for information extraction in multiple languages, along with the combination of BERT-based transformers and GPT-based LLMs, showcases the project's advanced technological capabilities.
AIDAVA's approach to building and adapting several ontologies supports long-term sustainability and interoperability. The project's focus on aligning with existing standards and addressing limitations in current ontology schemas contributes to the broader health data ecosystem.