Periodic Reporting for period 1 - IDEAA (Techniques, methods and tools for Issue-Driven European Arena Analytics: supporting citizens to easily explore the trove of publicly available data to build a viewpoint on a specific issue.)
Reporting period: 2018-07-01 to 2020-06-30
Many datasets are available in RDF, several are on the European Data Portal. They are also increasingly being published as part of Linked Open Data whose aim is to relate information from different data sources through semantic links. For example, a source D1 exposes information about People while a source D2 exposes information about Companies. D1 contains data about "Bill Gates" and D2 contains data about "Microsoft". We can link them by adding a property "co-founded" from "Bill Gates" in D1 to "Microsoft" in D2.
This leads to great opportunities to analyze and monitor the many different aspects of our society and tremendous challenges in making sense of the data. A lot of public information is available. However, RDF graphs tend to grow very much in size, heterogeneity and complexity. They are hard to read and acquiring a comprehensive picture of their content is very difficult as it is easy to get lost in the huge amount of data. This lack of knowledge in an abundance of content hinders our full participation to modern society where basically all interactions are either directly digital, or are captured/recorded in digital fashion.
In IDEAA we worked to overcome such challenge by pursuing three main objectives: (O1) the extraction of succinct and meaningful knowledge from RDF graphs as a means of representing the content of the graphs, (O2) suggesting, in an efficient way, interesting and unexpected aspects of the data to users and (O3) comparing knowledge from different sources.
Objective O1 is to extract meaningful knowledge from RDF graphs. This knowledge is at the basis of data exploration as it represents data in terms of their properties and avoids flooding users with too much punctual information. To achieve O1 we first gave a formal definition of the notion of knowledge and then provided ways to extract such knowledge directly from RDF data.
We defined knowledge in terms of “multi-dimensional aggregates” (MDAs). Consider a dataset about CEOs; an MDA, e.g. would be “Average age of CEOs grouped by nationality and number of managed companies”. The dataset itself contains all the data of each CEO (e.g. name, surname, age, nationality, etc.), the MDA, instead, exposes a more general property, that is, aggregated data.
Formally, an MDA is composed of: a topic of interest (the CEOs), the dimensions of analysis (nationality and number of managed companies), the measures (age) and an aggregation function (average). We provided a formal semantics of MDAs and an extensive set of options to identify them directly from RDF graphs. We also proposed ways to automatically derive, from the original data, novel dimensions and measures. This feature greatly enriches the pool of candidate MDAs, thus providing several new angles of analysis.
Objective O2 is to suggest interesting or unexpected aspects of the data. Several meaningful knowledge can be extracted from RDF graphs. For the CEOs, we extract the “Average age of CEOs grouped by nationality and number of managed companies” and also the “Sum of the net worth of CEOs with political connections, grouped by country of origin”, etc. O2 is to identify, among these MDAs, the most interesting or unexpected ones.
To achieve O2 we first defined what interesting or unexpected means when dealing with MDAs, then we provided several strategies to isolate the best ones. Aggregates whose result shows a trend or unexpected peaks are consider interesting. We quantify their interestingness using the statistical moments. Then we look for the MDAs with the best score.
W generate many candidate MDAs and determining their score ca be quite costly depending on the size of the data. Being able to do it in reasonable time is crucial, thus, efficiency played a key role in O2 and in the project as a whole. We proposed novel techniques to efficiently (i) quantify the interestingness of an MDA by simultaneously computing the results of several MDAs and (ii) prune uninteresting MDAs using early stopping to identify, as soon as possible, those that we can be sure (based on strong statistical evidence) are not the best ones.
Objective O3 is to compare knowledge extracted from different sources and grasp similarities or differences between them. A source about CEOs shows the “Average age of CEOs grouped by nationality and number of managed companies” whereas a source about Managers of organizations might show a different trend in the same knowledge. This insight shows that CEOs behave differently than managers.
RDF natively allows to model heterogeneous data. A resource in the data might be of type CEO while another might be of type Manager. Moreover, it allows to define hierarchies between types, e.g. we can state that both CEO and CTO are subtypes of Manager. Thus, all CEOs are Managers and all CTOs are Managers as well.
To reach O3 we take advantage of such hierarchies. We extract MDAs from resources of different types, then, we provide a visual comparison of the found MDAs. We compare MDAs with the same dimensions, measure and aggregation function. This comparison is performed among resources whose types are part of the same hierarchy. Thus, given the MDA “Average age grouped by nationality and number of managed companies”, we visualize its results considering the three heterogeneous sets of resources: CEOs, CTOs and Managers.
The results of our research have been published and presented on two international and one national conference. We took part to 4 poster sessions and disseminated our results through a website, regular seminars and social media activities.