Foundations of Factorized Data Management Systems

Periodic Reporting for period 4 - FADAMS (Foundations of Factorized Data Management Systems)

Reporting period: 2020-05-01 to 2022-05-31

This project investigates the performance of a new wave of data management systems that integrate analytics and query processing. It takes a first-principles approach to machine learning over relational databases that advances recent development in database systems and theory and is centred around factorization techniques that reduce the amount of redundancy in the data representation and in the computation over relational data. It also investigates the interaction of factorization with approximation and data updates.

Providing scalable solutions for analytics over relational databases is a defining challenge of our time. While this has always been a topic of research and commercial interest, three major trends exacerbate its current widespread adoption. It is much cheaper to generate data, due to inexpensive sensors, smart devices, social software, and the Internet of Things. It is much cheaper to process data, due to advances in multicore CPUs, inexpensive cloud computing, and open source software. Finally, our society has become increasingly more computational, with many categories of people becoming involved in the process of generating, processing, and consuming data, such as decision makers, domain scientists, application users, journalists, crowd workers, and everyday consumers. This wide adoption comes with pressing demands for scalable large-scale data management from businesses, governments, academic disciplines, engineering, communities, and individuals.

Mainstream solutions for analytics over databases currently represent a highly lucrative business. The input to learning classification and regression models is defined by feature extraction queries. Their approach is to materialize the training dataset, export it out of the database, and learn over it using statistical software packages. These steps are expensive and unnecessary. Instead, this project casts the machine learning problem as a database problem. It merges the data-intensive computation into the feature extraction queries and solves them using factorization techniques that exploit the structure (algebraic and combinatorial) of the workload. This leads to lower computational complexity and faster runtime even than the materialization of the training dataset.

On the theoretical foundations, we addressed the problems of covers of query results of worst-case optimal size, factorization synthesis for conjunctive queries with negated sparse relations, worst-case optimal evaluation of queries with intersection joins, worst-case optimal dynamic evaluation of triangle and hierarchical queries, training machine learning over databases. Our system efforts addressed the design and implementation of prototypes for training a variety of models over databases, matrix decomposition over databases, and for maintaining models under data updates. We released three prototypes in the public domain: F-BENCH for factorized static query evalaution; F-IVM for factorized dynamic query evaluation; and LMFAO for training models over databases. These works are overviewed in conference tutorials and keynotes.

Our key contributions are on factorized static and dynamic relational query processing and learning. We establish fundamental results grouped in four topics.

We pinpoint the computational complexity of learning models over relational data. We show how to exploit join dependencies and reparameterize models under functional dependencies.

We show how to compute the QR decomposition of a matrix defined by acyclic joins in time proportional to the input database and independent of the join output.

We propose a dynamic computation approach that unifies training for regression, matrix chain multiplication, and queries. We give the first worst-case optimal fully dynamic evaluation algorithm for triangle queries and pinpoint the trade-off between preprocessing time, update time, and enumeration delay for hierarchical queries.

We establish the computational complexity of Boolean conjunctive queries with intersection joins.

A sample of our key contributions includes: factorized relational learning, fully dynamic query evaluation in worst-case optimal time, and worst-case optimal evaluation of intersection joins.

There is strong interest from industry and extensive attention from academia for our work on factorized learning. This learning paradigm has been recognized as key to the relevance of database research in the current landscape dominated by machine learning. This work has been invited for plenary keynote and tutorial presentations at international venues: Highlights of Logics, Games and Automata 2018; Conférence sur la Gestion de Données – Principes, Technologies et Applications 2018; International Conference on Database Theory and Extending Database Technology 2019; International Conference on Scalable Uncertainty Management 2019; FG DB Symposium ML for Systems and Systems for ML 2020; Very Large Data Bases 2020. This is complemented by many invited talks at academic institutions, workshops, Dagstuhl seminars, and industry labs. There were invited PhD-level courses on factorized computation at: PhD Open, University of Warsaw (2018); Turing Data Science, The Alan Turing Institute (2018); LogiCS/RiSE Summer School, TU Vienna (2017). This work forms the basis for a new graduate course at University of Zurich starting 2021.

Our work on fully dynamic query evaluation in worst-case optimal time opens up a new line of research on charting the exact complexity of dynamic evaluation for relational queries and on their trade-offs between preprocessing time, update time, and enumeration delay. It led to follow-up works in the database theory and algorithms communities.

Our work on establishing the computational complexity of queries with intersection joins was a long-standing open problem in spatio-temporal data management. It relies on novel techniques that reduce intersection joins to equality joins and show promise for a wide range of queries beyond intersection, such as membership and inequality joins.

The work that put forward factorized databases and lies at the foundation of FADAMS received the International Conference on Database Theory (ICDT) 2022 Test-of-Time Award, which recognizes a paper presented 10 years prior at the ICDT conference that has best met the "test of time" and had the highest impact in terms of research, methodology, conceptual contribution, or transfer to practice over the past decade. The Factorized Databases workshop (Zurich, 2022) showcases 18 research projects from the international database research and industry communities and that draw on factorized database principles. These projects are on graph database systems; enumeration in static and dynamic query evaluation; query provenance management; and factorized data analytics including factorized in-database machine learning. The work on worst-case optimal maintenance for counting triangles received the ICDT'19 best paper award. Extended versions of our PODS'18, PODS'19, and ICDT'19 papers appeared in special issues the premier ACM TODS journal of best papers at these conferences. A FADAMS group member received an Honourable Mention for the 2021 SIGMOD Jim Gray Doctoral Dissertation Award and was Highly Commended in the UK-wide competition for the CPHC/BCS 2020 Distinguished Dissertation Award. Four students received prizes for the best thesis at Oxford for their work on FADAMS.

logo-landing.png

Periodic Reporting for period 4 - FADAMS (Foundations of Factorized Data Management Systems)

Share this page

Download