Periodic Reporting for period 1 - EDAO (Example-Driven Analytics of Open Knowledge Graphs)
Reporting period: 2019-09-15 to 2021-09-14
These are usually modelled in Open Knowledge Graphs, which is a way to model information as entities linked by semantic relationships.
Yet, to access this type of data and perform such analysis, the typical gateways are specialized query languages (e.g. SPARQL) that are usually challenging to use for non-expert users.
This constitutes a major impediment to the successful exploitation of Linked Open Data.
To support advanced LOD analytics we propose a novel data exploration system that allows users to extract insights within complex and unfamiliar datasets.
We studied and proposed methods to help non-expert users to perform exploratory analysis of open knowledge graphs by applying the Exemplar Query paradigm to the case of Exploratory Online Analytical Processing (OLAP).
Example-based methods have proven to be extremely valuable since they avoid complex query languages by using examples to represent the required information.
Yet, they have never been studied in the OLAP/BI context. Therefore, we propose to study a new Example-Driven Exploration system to bridge the gap between example-based queries and BI methods.
We study two important aspects of the problem: how to enable users to express their information need, and how to make this approach efficient for rich and large scale datasets.
We have performed an initial survey on the literature about graph data management, graph analytics, dataset search, and data exploration.
We have also conducted an in-depth study of data management architectures for Knowledge Graphs storage (as RDF graphs).
Moreover, we interacted through unstructured interviews both with practitioners in data management and with domain experts in environmental engineering (which participated on a voluntary basis) about their current dataset search activities, data exploration workflow, their interaction with new unfamiliar datasets, and with open data.
In connection with this initial analysis, we collaborated with the domain experts in environmental engineering to construct a knowledge graph and a dataset for which our approach can be applied.
This allowed us to gain important insights for the development of a data exploration system to enable analysis for statistical data stored in RDF.
Then, we worked on proposing effective methods to: (i) enable non-expert users in synthesis exploratory queries from example, (ii) improve dataset search capabilities by exploiting knowledge graphs, and (iii) improve the performance of triplestores through query optimization and view materialization techniques.
As a result, within this project, we propose new algorithms and a system implementing them able to help non-expert users to obtain complex analytical insights by reverse engineering SPARQL queries from examples of interest.
Moreover, to help the developers of analytical systems, we identify important storage and query answering techniques that can be applied to ensure efficient query executions for complex graph queries.
We have published an official project website (https://edao.eu/) created a profile for the project on the code-sharing platform GitHub (https://github.com/EDAO-Project) and a social media account on Twitter (https://twitter.com/EDAO_eu) which have been used and will be further used for the dissemination of the project results.
We also presented the initial results of the project and participated in scientific and networking events and conferences, namely: BrainsXBusiness event "Kick-off: AI for the People", ESWC, EDBT, VLDB.
We also organized two workshops on "Search, Exploration, and Analysis in Heterogeneous Datastores" to reach researchers and practitioners interested in the broad topics of this project which took place co-located with EDBT 2020 and VLDB 2021.
(1) designed methods to allow users to perform exploratory analytics by synthesizing queries from examples and simple interactions;
(2) studied the application of active learning techniques to expedite the query synthesis process;
(3) proposed a novel method for dataset-search based on the exploration of links between datasets and a knowledge graph;
(4) proposed a query optimization method for queries based on the definition of "validating shapes";
(5) implemented a system to test different view materialization techniques to optimize analytical queries over knowledge graphs;
(6) proposed how to extend the citation-graph model to allow for data citation;
(7) identified the current limitations of existing systems in enabling effective knowledge graph exploration and proposed a vision for further research within this subject.
Therefore, in summary, our work introduced new models for example-driven query reverse engineering of analytical queries on Knowledge graphs, these methods did not exist before.
Also, we provide the first in-depth study of storage architecture for triplestores, this will allow a principled approach to overcome existing limitations in the efficient and effective design of knowledge graph management systems.
Finally, we accompanied our data management techniques to novel applications in two important domains: industrial ecology and digital libraries, introducing a novel model for better use of knowledge graphs in supporting open science.
The results of our work have, in the short term, the potential to guide the advancement in the functionality of both open-source and commercial systems.
In the long term, when such a system will be made available to the public, it should enable more people, especially non-expert users, to directly accessing data published in open knowledge graphs, therefore ensuring a broader "democratization" of the access to information.