Skip to main content

The Computational Database for Real World Awareness

Periodic Reporting for period 3 - CompDB (The Computational Database for Real World Awareness)

Reporting period: 2020-06-01 to 2021-11-30

Two major hardware trends have a significant impact on the architecture of database management systems (DBMSs): First, main memory sizes continue to grow significantly. Machines with 1 TB of main memory and more are readily available at a relatively low price. Second, the number of cores in a system continues to grow, from currently 60 and more to hundreds in the near future. This trend offers radically new opportunities for both business and science. It promises to allow for information-at-your-fingertips, i.e. large volumes of data can be analyzed and deeply explored online, in parallel to regular transaction processing. Currently, deep data exploration is performed outside of the database system which necessitates huge data transfers. This impedes the processing such that real-time interactive exploration is impossible. These new hardware capabilities now allow to build a true computational database system that integrates deep exploration functionality at the source of the data. This will lead to a drastic shift in how users interact with data, as for the first time interactive data exploration becomes possible at a massive scale.
Within the reporting period we made great progress on our new system architecture, addressing many challenges in efficient compilation and language integration. One of the goals is this project is seamless integration of high-level data processing, specified in a programming language, with traditional database query support. This has many technical challenges, including, somewhat surprisingly, compile time: When specifying a complex algorithm and then later executing it on a very efficient parallel execution engine, the compile time can be higher than the actual execution time. This turned out to be problematic for interactive use cases, but we developed a new compilation framework that adaptively compiles the different parts of the execution plan depending upon usage: The code is compiled initially using a very cheap compiler that is optimized for compile time and uses a new linear time register allocator, and more expensive compilation modes are then used to improve the initial code only when the observed execution times and the cost model predict expensive compilation to be beneficial. This allows for every efficient execution of “cheap” queries (i.e. queries that might be structurally complex, but that touch comparatively little data), while complex analytical still benefit from the full power of an optimizing compiler backend. Extensive work on algebraic optimization leads to an improved query optimization component, which is essential for handle large and complex analytical queries, whereas previous approaches were unable to find solutions for large queries. Our optimization framework can now handle all classes of queries, including queries with cross products and hyper-edges, which is important to handle arbitrary analytical queries.
Furthermore, we worked on integrating user defined operators into the query execution workflow, which will be used as building block for executing high-level execution logic.
The compilation and optimization work significantly advanced the start of the art, and accordingly was published in top venues (SIGMOD and ICDE), including a best paper award for the compilation work. Together with other compilation techniques we are now able to assemble very complex analytical queries, including window functions and complex aggregations, from low-level primitives leading to very fast execution plans.
We are currently in the process of publishing work on complex analytical processing using user defined logic, and we are extending the work on distributed processing, including cloud scenarios.
Adaptive compilation strategies across data sizes