Periodic Reporting for period 5 - ProDIS (Provenance for Data-Intensive Systems)
Berichtszeitraum: 2024-12-01 bis 2025-11-30
Answering such questions is of paramount importance for the trustworthiness of systems, for accountability with respect to decision making that is based on the operation of such systems, and for finding and correcting errors in the systems. Yet they are highly non-trivial and require the development of novel models, algorithms and dedicated software.
These questions and related ones were studied as part of the ProDIS research project, centered around the notion of provenance. In a nutshell, provenance is a record of the computation that took place. The main problems that are addressed in this research are:
(1) How to correctly define and model provenance for data-intensive systems, so that it keeps track, in a concise yet useful way, of the essence of computation that took place? Naturally, one may keep detailed track of every operation that has taken place and every data access, yet this would be redundant and prohibitively costly to track, in terms of both space and time. Designing provenance models that capture the essence of computation is essential.
(2) how to efficiently track provenance for complex systems and large-scale data? Given a choice of provenance model, the algorithmic challenge is then that of efficient provenance tracking. Provenance tracking inevitably comes with a computational cost that is beyond the cost of running the system without provenance; reducing this cost as much as possible is of great importance to allow for the feasibility of the approach. Performance in terms of space and time naturally becomes even more crucial when dealing with large-scale data, as is typically the case with such systems.
(3) how to leverage provenance for explainability? While provenance provides a record of the computation that took place and is key for explainability, it is often insufficient by itself. This is because the raw provenance information is too large and complex to be presented, especially to non-experts. How can we derive explanations, and in particular answers to questions such as those highlighted above, from the provenance information?
(4) how to implement and test the performance of solutions? What are useful interfaces through which one could show explanations, and what are appropriate baselines on which one can test the scalability and effectiveness of solutions?
Our results have significantly advanced the state-of-the-art in all of the above fronts, developing new provenance models, new algorithms that efficiently compute provenance, new tools that use provenance for explainability, and new implementations and experimentation benchmarks on which we have tested the solutions.
Provenance Models for highly expressive frameworks: in contrast to previously proposed provenance models that focus on analyzing static data (namely, they assume that the data is queried but does not change), we have developed (paper in SIGMOD 2020) a novel provenance model for update queries. The main feature of our model is that it captures, in a way made precise in the paper, the “essence of computation” performed by such queries. Namely, we show that every two equivalent update queries yield equivalent provenance expressions. We show that the model is useful for diverse applications, including hypothetical reasoning, fine-grained access control, and data trustworthiness certification.
Efficient Provenance Tracking:
We have further developed multiple solutions that efficiently track and maintain provenance in particular forms for fragments of SQL. These forms are favorable for computation of explanations that are based on provenance. Specifically, we have shown how to compute provenance for boolean yes/no queries (paper in SIGMOD 2022) and for queries performing particular types of arithmetics known as aggregate functions (paper in VLDB 2025). The solutions that we have developed involve algorithms that scale well to large databases even in case of complex queries.
Tools that Use Provenance
Provenance is key to explainability and we have developed provenance-based explanation tools of various facets and for various settings. The tools we have developed significantly push the state-of-the-art in terms of explainability for data intensive systems, as follows:
(1) Hypothetical reasoning, namely computation of query results under hypothetical updates. We have developed a novel approach for provenance abstraction that allows for efficient hypothetical reasoning. The idea is that we abstract the provenance expression in a way that slightly reduces the number of hypothetical scenarios that we can support, in exchange to large gains in terms of performance. The results were summarized in a paper published in SIGMOD 2020. (2) Provenance-based attribution: attribution assigns contribution score with respect to each input-output pair. Our work, published in SIGMOD ‘22, SIGMOD ‘24 and VLDB ‘25, has significantly advanced the state-of-the-art in provenance-based attribution. Specifically, in our SIGMOD ‘22 paper we have shown an algorithm for computing Banzhaf and Shapley values (notable forms of attribution, originating in Game Theory) for Select-Project-Join-Union queries which was the fastest to date. In SIGMOD ‘24 we have further improved this state-of-the-art, achieving even faster algorithms, and in VLDB ‘25 we have further developed similar, and more efficient, algorithms that further account for aggregate queries as well.
Implementations, Benchmarks and Experiments: for each of the contributions we have not just developed novel algorithms but also implemented these algorithms in prototypes and used these prototypes to experimentally evaluate the performance of our solutions for standard benchmarks. Based on these we have developed a provenance benchmark whose details are described in the SIGMOD ‘22 and SIGMOD ‘24 papers mentioned above. Each of the aforementioned papers include extensive experimental evaluation of our developed solutions.