Skip to main content
European Commission logo print header

Provenance for Data-Intensive Systems

Periodic Reporting for period 3 - ProDIS (Provenance for Data-Intensive Systems)

Reporting period: 2021-12-01 to 2023-05-31

Complex Data Analytics is influential in almost every aspect of our lives. It is employed in scientific experiments, in medical-related decision making, for marketing, and in many other contexts. An artifact of this great progress is that decisions and actions, whose effects are sometimes crucial, are often being made in part by algorithms whose internals are at best only known to experts; sometimes the algorithms operation is so complex that even programmers and domain experts struggle at understanding the logic that was followed. This raises significant concerns of various flavors: how sensitive is a result to changes in the data? How trustworthy is it? What is the underlying rationale of its existence in the result set?

The goal of the ProDIS project is to build models, algorithms and software frameworks that address the above challenges. In terms of importance to society, it has the potential to improve transparency and credibility of decision making, and to reveal and correct errors in the data and/or the analytical process.

The solutions that we build are based on the notion of data provenance - a record of the transformations that data undergoes. While the importance of data provenance is well-recognized and it has been studied in different settings, practical data-intensive system typically do not employ provenance solutions. The reason is that despite the large body of work and important advancements made in the research on data provenance, there are multiple significant gaps between the state-of-the-art in data provenance research, and the requirements from provenance support for state-of-the-art complex systems.
The high-level goal of the project is to develop robust solutions for the tracking and use of provenance, that can seamlessly be integrated with data-intensive system. Consequently, system developers, users and anyone affected by the operation of the system will be able to understand what data was processed and in what way, and why was each result of the system computed as such. Provenance support will provide important details on the use of data, lead to avoiding mistakes, and ultimately to systems of better quality.
We have been working on provenance solutions in multiple axes.
The first axis is building provenance models that are far more general than the state-of-the-art in that they apply to a larger span of systems and data analytics formalisms. We have already built solutions for systems that interact with users in Natural Language, for systems that allow users to explore large databases, and for complex formalisms for updating data.
The second axis is designing efficient algorithms and methods that optimize the tracking and storage of provenance for big data and complex analytics. Here we have developed solutions that can handle data that is bigger by several orders of magnitude with respect to previous solutions.
The third axis is solutions that use provenance for applications: we have developed tools for provenance-based explanations in Natural Language as well as provenance-based hypothetical reasoning allowing to analyze data-intensive systems under multiple hypothetical scenarios of modifications to the underlying data.
We anticipate progress in all axes detailed above: solutions that apply to more complex systems and that can handle bigger datasets faster.