Periodic Reporting for period 1 - AIDAVA (AI powered Data Curation & Publishing Virtual Assistant)
Reporting period: 2022-09-01 to 2024-02-29
Collaborating with medical experts, AIDAVA established two use cases: the Federated Breast Cancer Registry and Risk Monitoring for Myocardial Infarction Patients. These cases involved mapping out scenarios, selecting data sources, defining data flow compliance with privacy standards and identifying evaluation metrics.
The team finalised the requirements, design, and architecture of the prototype. Development, including AI-based curation tools, is underway. In parallel, a study protocol was developed to test the prototype across three hospital sites; approval from one ethical committee was already secured. Patient feedback was provided by eight patients from two associations who were involved in designing the prototype and defining the protocol, and who expressed a keen interest in the success of the project.
To achieve the core objective of AIDAVA, automation in data curation and quality enhancement, the team identified 11 interoperability issues. For each issue, the team developed a workflow enabling automatic resolution by incorporating AI-based curation tools and human-in-the-loop dialogs. A data source onboarding tool was developed to map attributes from data sources to one of workflows or directly to the target ontology. A Data Quality Framework was established, and implemented on the curated data through Data Quality checks on consistency and completeness.
In terms of natural language processing, the team annotated documents in Estonian, Dutch, and German to train NLP and Entity Linking models.
The project also integrated a structured governance process to work with an existing ontology (SPHN), planning to expand to a family of ontologies in the second prototype generation to ensure long-term sustainability.
This phase of AIDAVA has showcased its ability to tackle complex data curation and interoperability challenges, laying a strong foundation for a virtual assistant that could transform personal health data management.
AIDAVA leverages AI to automate health data curation, enhancing data quality and efficiency while reducing human error. In addition, by introducing an interim step in the health data life cycle - the Personal Health Knowledge Graph - AIDAVA facilitates the reuse of data across multiple datasets without repeated curation. This shift from the costly "curate many times, use once" model to a more efficient "curate once, use many times" approach significantly improves the management of heterogeneous health data.
Data Quality is a topic addressed in many papers, yet there are not many instruments in place to measure and improve health data quality due to the heterogeneity of these data at the source. By centering curation on an integrated health record in a standardised format (the Personal Health Knowledge Graph), AIDAVA opens a new approach by enabling verification of data quality at the level of the patient and enabling derivation of data quality labels on published datasets. This approach was defined in the AIDAVA Data quality framework and is being successfully implemented with the SHACL validator.
Most NP models have been developed for English. AIDAVA is developing models for 3 European languages - one of which (Estonian) spoken by less than 1.5 million inhabitants - and will be looking at multilingual models. To optimise the needed annotation effort, the project developed annotation guidelines that have already been discussed outside of the project and will be published to the wider NLP community; the resulting annotated datasets will be available - conditional to Ethical Committee approval - for other researchers. The development of NLP models for information extraction in multiple languages, along with the combination of BERT-based transformers and GPT-based LLMs, showcases the project's advanced technological capabilities.
AIDAVA's approach to building and adapting several ontologies supports long-term sustainability and interoperability. The project's focus on aligning with existing standards and addressing limitations in current ontology schemas contributes to the broader health data ecosystem.