Periodic Reporting for period 1 - CODAC (Commoditizing Data Analytics in the Cloud)
Période du rapport: 2023-02-01 au 2025-07-31
The CODAC system consists of three main components: First, an intelligent control component that automatically and transparently selects and manages the cheapest hardware instances for the given workload and makes migration to other (e.g. European) cloud vendors possible. Second, a highly-efficient and scalable query processing engine that is capable of fully exploiting modern cloud hardware. Third, a data lake storage abstraction based on open data formats that enables cheap storage as well as modularity and interoperability across different data systems. The resulting system therefore has the potential for making large-scale data analytics both cheaper and easier.
We developed BtrBlocks, an open-source columnar compression format optimized for cloud data lakes. BtrBlocks significantly increases decompression speed compared to current solutions while maintaining excellent compression ratios. Its modular design allows future enhancements with new encoding schemes, enabling further reductions in cloud storage and processing costs.
We introduced Umami, a query processing framework enabling seamless transition between in-memory and out-of-memory workloads without performance losses. A prototype implementation of Umami, the query processing engine Spilly efficiently leverages modern NVMe SSDs, achieving performance close to purely in-memory systems. Unlike other query engines, Umami fully utilizes NVMe's high bandwidth, resulting in a significant cost reduction. Ongoing work extends Umami to distributed processing for enhanced scalability.
We also created SaneQL, a simplified and expressive query language designed to overcome SQL's complexity, making queries clearer and easier to write. SaneQL serves as a stepping stone toward SaneIR, a unified standard for relational data semantics. This unification of semantics supports interoperability between database components such as languages, optimizers, query engines, and storage systems. Enhanced interoperability reduces vendor lock-in and gives users greater flexibility.
SaneIR has considerable potential impact as it addresses the significant challenges posed by inconsistent query processing semantics across different database systems. Currently, this inconsistency hampers interoperability and reinforces vendor lock-in as different systems might produce different result for the same queries. Adoption of SaneIR would enable a more interoperable, modular, and competitive data processing ecosystem, benefiting users through increased flexibility, reduced costs, and enhanced innovation in data-driven technologies.
To realize our vision of seamless, cost-effective scalability without sacrificing performance, further research and development is required. Moreover, efficient data migration solutions between major cloud platforms such as AWS, Azure, and Google Cloud must be developed to overcome vendor-specific limitations and costs. Creating integration tools and specialized "glue code" tailored to various cloud providers will be crucial for practical deployment and commercial viability. As for adoption, building user trust remains a big challenge, especially given the sensitive nature of data handled: Cloud providers offering data processing systems have had years to build trust, and have the names of big companies behind them. To overcome this challenge, we plan to rely on open source licenses, independent verification, and open communication to build user trust. Finally, addressing compliance with EU regulatory frameworks will ensure legal and operational compatibility for large-scale data processing.