Skip to main content

PRODIMA: PRObabilistic Data and information Integration with provenance MAnagement

Final Report Summary - PRODIMA (PRODIMA: PRObabilistic Data and information Integration with provenance MAnagement)

The PRODIMA project (http://www.cs.ox.ac.uk/projects/PRODIMA/index.html) has investigated provenance-based probabilistic information integration in the Semantic Web. To provide insights into scalable probabilistic information integration, the project has studied existential rules of the Datalog+/- family as a mapping language for (probabilistic ontological) data exchange on the Web. By using Datalog+/-, information residing both in databases and in ontologies in the Semantic Web can be integrated, enabling also ontology-based data access and exchange. Important results of PRODIMA are an approach for probabilistic ontological data exchange and a precise picture of the computational complexity of deciding the existence of a (universal) probabilistic solution for different classes of existential rules. Several annotations with probabilistic provenance events (which were used to encode provenance information) for probabilistic ontological data exchange have been investigated: a very general one based on elementary-event-independence of the probabilistic provenance events, a second one based on tuple-independence, and a third one based on Bayesian networks (which are a well-known, -investigated, and -established formalism for representing (causal) dependencies between probabilistic events). Different kinds of features of uncertain provenance have been explored and how they can be used for probabilistic information integration during the reasoning process. Mapping debugging strategies for information integration based on existential rules have been investigated as well, yielding a new approach for mapping debugging together with a complexity analysis and the identification of scalable fragments. In addition, the project has also analysed provenance-based reasoning with preferences in the Social Web and identified means for scalable query answering in this context.

With probabilistic information integration, PRODIMA has tackled a crucial problem at the core of our knowledge and information society, namely to deal in a meaningful way with huge amounts of distributed and independently created data and information, as available in the Web, but also in many other environments nowadays. Information integration is one of the main current challenges of information technology. It has strong relations to the challenge of managing big data, which are usually stored in a distributed data environment. Solutions are desperately needed, and they have to deal with uncertainty, which e.g. results from the automatic creation of (huge amounts of) mappings. Hence, obviously, the results and insights gained in PRODIMA are very important for our knowledge and information society by contributing to making intelligent access to (integrated) information easier, faster, and better. In addition, the results of the project pave the way to the development of profitable business solutions on an international level. In addition, the results and insights on provenance in general and on provenance in probabilistic information integration have a huge impact on other areas as well. Provenance is a very fundamental concept that is important in many areas (very much like information integration itself) such as digital preservation and scientific data management, to name a few. The application-independent results of PRODIMA affect all these areas as well. For example, consider archives where provenance is the most important principle for organizing archival records. The insights of PRODIMA have the potential to also lead to a significant improvement of digital preservation strategies, and, hence, to better digital archives. In this way, both the preservation of and the access to digital cultural heritage will be enhanced.