Skip to main content

Techniques, methods and tools for Issue-Driven European Arena Analytics: supporting citizens to easily explore the trove of publicly available data to build a viewpoint on a specific issue.

Periodic Reporting for period 1 - IDEAA (Techniques, methods and tools for Issue-Driven European Arena Analytics: supporting citizens to easily explore the trove of publicly available data to build a viewpoint on a specific issue.)

Reporting period: 2018-07-01 to 2020-06-30

The final aim of the IDEAA MSCA is to support non-expert users to easily explore the trove of publicly available graph-based data and, in particular, semantic, i.e. RDF, data. RDF is a standard model where data are represented as triples subject-predicate-object. For example, the notion “Bill Gates has occupation entrepreneur” in RDF is represented by: a subject denoting “Bill Gates”, a predicate denoting “has occupation”, and an object denoting “entrepreneur”. Similarly, the fact that “Bill Gates co-founded Microsoft” is represented by: again a subject denoting “Bill Gates”, a predicate denoting “co-founded”, and an object denoting “Microsoft”. One can further increase the data by adding, e.g. that “Microsoft is of type technology company”. Here “Microsoft” becomes the subject of this triple, the predicate is “of type”, while the object is “technology company”. In this small example, the RDF dataset is composed of three triples.

Many datasets are available in RDF, several are on the European Data Portal. They are also increasingly being published as part of Linked Open Data whose aim is to relate information from different data sources through semantic links. For example, a source D1 exposes information about People while a source D2 exposes information about Companies. D1 contains data about "Bill Gates" and D2 contains data about "Microsoft". We can link them by adding a property "co-founded" from "Bill Gates" in D1 to "Microsoft" in D2.

This leads to great opportunities to analyze and monitor the many different aspects of our society and tremendous challenges in making sense of the data. A lot of public information is available. However, RDF graphs tend to grow very much in size, heterogeneity and complexity. They are hard to read and acquiring a comprehensive picture of their content is very difficult as it is easy to get lost in the huge amount of data. This lack of knowledge in an abundance of content hinders our full participation to modern society where basically all interactions are either directly digital, or are captured/recorded in digital fashion.

In IDEAA we worked to overcome such challenge by pursuing three main objectives: (O1) the extraction of succinct and meaningful knowledge from RDF graphs as a means of representing the content of the graphs, (O2) suggesting, in an efficient way, interesting and unexpected aspects of the data to users and (O3) comparing knowledge from different sources.
We proposed a modular framework for RDF data exploration concretized in a software that encompasses all the research efforts to reach each objective.

Objective O1 is to extract meaningful knowledge from RDF graphs. This knowledge is at the basis of data exploration as it represents data in terms of their properties and avoids flooding users with too much punctual information. To achieve O1 we first gave a formal definition of the notion of knowledge and then provided ways to extract such knowledge directly from RDF data.

We defined knowledge in terms of “multi-dimensional aggregates” (MDAs). Consider a dataset about CEOs; an MDA, e.g. would be “Average age of CEOs grouped by nationality and number of managed companies”. The dataset itself contains all the data of each CEO (e.g. name, surname, age, nationality, etc.), the MDA, instead, exposes a more general property, that is, aggregated data.

Formally, an MDA is composed of: a topic of interest (the CEOs), the dimensions of analysis (nationality and number of managed companies), the measures (age) and an aggregation function (average). We provided a formal semantics of MDAs and an extensive set of options to identify them directly from RDF graphs. We also proposed ways to automatically derive, from the original data, novel dimensions and measures. This feature greatly enriches the pool of candidate MDAs, thus providing several new angles of analysis.

Objective O2 is to suggest interesting or unexpected aspects of the data. Several meaningful knowledge can be extracted from RDF graphs. For the CEOs, we extract the “Average age of CEOs grouped by nationality and number of managed companies” and also the “Sum of the net worth of CEOs with political connections, grouped by country of origin”, etc. O2 is to identify, among these MDAs, the most interesting or unexpected ones.

To achieve O2 we first defined what interesting or unexpected means when dealing with MDAs, then we provided several strategies to isolate the best ones. Aggregates whose result shows a trend or unexpected peaks are consider interesting. We quantify their interestingness using the statistical moments. Then we look for the MDAs with the best score.

W generate many candidate MDAs and determining their score ca be quite costly depending on the size of the data. Being able to do it in reasonable time is crucial, thus, efficiency played a key role in O2 and in the project as a whole. We proposed novel techniques to efficiently (i) quantify the interestingness of an MDA by simultaneously computing the results of several MDAs and (ii) prune uninteresting MDAs using early stopping to identify, as soon as possible, those that we can be sure (based on strong statistical evidence) are not the best ones.

Objective O3 is to compare knowledge extracted from different sources and grasp similarities or differences between them. A source about CEOs shows the “Average age of CEOs grouped by nationality and number of managed companies” whereas a source about Managers of organizations might show a different trend in the same knowledge. This insight shows that CEOs behave differently than managers.

RDF natively allows to model heterogeneous data. A resource in the data might be of type CEO while another might be of type Manager. Moreover, it allows to define hierarchies between types, e.g. we can state that both CEO and CTO are subtypes of Manager. Thus, all CEOs are Managers and all CTOs are Managers as well.

To reach O3 we take advantage of such hierarchies. We extract MDAs from resources of different types, then, we provide a visual comparison of the found MDAs. We compare MDAs with the same dimensions, measure and aggregation function. This comparison is performed among resources whose types are part of the same hierarchy. Thus, given the MDA “Average age grouped by nationality and number of managed companies”, we visualize its results considering the three heterogeneous sets of resources: CEOs, CTOs and Managers.

The results of our research have been published and presented on two international and one national conference. We took part to 4 poster sessions and disseminated our results through a website, regular seminars and social media activities.
To the best of our knowledge, through this research we have proposed and realized the first comprehensive, completely RDF-oriented framework for automatic RDF data exploration. The work we carried on so far opens up a lot of possibilities for future research questions and improvements. We have identified several interesting future directions both as a consequence of our research efforts and through the feedbacks we received while presenting the work at several conferences. A further enhancement of interactivity is a promising direction, including for example the possibility to easily filter the data and the results. Furthermore, providing better support for specific types of data, e.g. temporal or geo-spatial data, is also an interesting research line.
Plot obtained using the techniques proposed in the project.
Social event at BDA 2019 where our work was given the Best Demo Award.
Presenting our research and showing a demo of the prototype developed in the project.