Periodic Reporting for period 3 - INODE (INODE - Intelligent Open Data Exploration)
Reporting period: 2022-05-01 to 2023-04-30
Linking new data and making them easily queryable is the first critical requirement in a data exploration scenario. INODE provides a knowledge graph which constitutes a high-level conceptual view of the data. By querying the graph, the user can access the information stored in the data sources by means of a more convenient vocabulary, does not need to be aware of storage details, and can obtain richer answers thanks to the domain knowledge. This is especially critical for our biology use case, since biological data are very complex and distributed over different sources. INODE supports rich queries (e.g with aggregate functions) in full compliance with the SPARQL 1.1 standard. Furthermore, to leverage unstructured data, INODE focused on triple extraction from unstructured text, and database enrichment via entity linking of the extracted triples with ontology concepts, aiming at enriching the content of the OncoMX (bio) database.
INODE enables data access through two powerful paradigms: search by Natural Language and Explore using operators. For the former, different tools empower the INODE platform that use different and complementary technologies: rule-based and deep learning-based technologies. For the latter, powerful operators allow the user to manipulate the results. For example, a By-neighbors operator searches the neighborhood of a set of items and returns close sets, a powerful operation for finding for example galaxies for our astrophysicists. To facilitate the user to understand the result, at each step, INODE provides Natural Language explanations for easy result interpretation. Furthermore, recommendations provide different options for data exploration. Visual exploration improves data understanding by increasing information density and providing a better overview over multiple search results.
In the second and third period of the project, the INODE platform was extended with additional functionality and the accuracy of the services was improved. For instance, the machine learning models for the natural language operators were improved to enable answering more complex user queries against databases and to provide better query recommendations. Moreover, more advanced data exploration operators were introduced that guide the users during data analytics. In addition, novel visualization methods were developed to better navigate through large and complex data sets. Finally, INODE now enables federated queries over distributed data sources.
By leveraging the full potential of their combined outcome, INODE provides for the first time a complete and powerful end-to-end data exploration solution. This solution equips a diverse set of stakeholders (in astrophysics, biology, and policy making) with intuitive tools for serving their data access requirements, including ontology-based data linking and access as well as effective ways to access data (especially when non-technical users are involved). To do so, INODE has accomplished several advances in these technologies at the intersection of the tools and corresponding technologies.
It is important to mention that the concepts of INODE can be applied not only to empower EOSC-hub but to any data portal. Hence, there is a big opportunity that not only science data is explored more widely, but also the vast amounts of open data that are currently provided by any data portal such as the EU data portal.