Commoditizing Data Analytics in the Cloud

Informations projet

CODAC

N° de convention de subvention: 101041375

DOI

10.3030/101041375

Date de signature de la CE 15 Septembre 2022

Date de début 1 Février 2023

Date de fin 31 Janvier 2028

Financé au titre de

European Research Council (ERC)

Coût total

€ 1 498 125,00

Contribution de l’UE

€ 1 498 125,00

1 498 125,00

Coordonné par

TECHNISCHE UNIVERSITAET MUENCHEN
Germany

Periodic Reporting for period 1 - CODAC (Commoditizing Data Analytics in the Cloud)

Période du rapport: 2023-02-01 au 2025-07-31

The goal of this project is to commoditize large-scale data analytics in the cloud, which means (a) to reduce query processing cost to close to what the available hardware is theoretically capable of and (b) to prevent vendor lock-in by making it easy to move data between different clouds and systems. To achieve this, we will design and develop an open and cost-efficient analytical database system for the cloud called CODAC.

The CODAC system consists of three main components: First, an intelligent control component that automatically and transparently selects and manages the cheapest hardware instances for the given workload and makes migration to other (e.g. European) cloud vendors possible. Second, a highly-efficient and scalable query processing engine that is capable of fully exploiting modern cloud hardware. Third, a data lake storage abstraction based on open data formats that enables cheap storage as well as modularity and interoperability across different data systems. The resulting system therefore has the potential for making large-scale data analytics both cheaper and easier.

The CODAC project aims to make cloud analytics affordable and prevent vendor lock-in. We have made significant progress, particularly in storage and query processing technologies.

We developed BtrBlocks, an open-source columnar compression format optimized for cloud data lakes. BtrBlocks significantly increases decompression speed compared to current solutions while maintaining excellent compression ratios. Its modular design allows future enhancements with new encoding schemes, enabling further reductions in cloud storage and processing costs.

We introduced Umami, a query processing framework enabling seamless transition between in-memory and out-of-memory workloads without performance losses. A prototype implementation of Umami, the query processing engine Spilly efficiently leverages modern NVMe SSDs, achieving performance close to purely in-memory systems. Unlike other query engines, Umami fully utilizes NVMe's high bandwidth, resulting in a significant cost reduction. Ongoing work extends Umami to distributed processing for enhanced scalability.

We also created SaneQL, a simplified and expressive query language designed to overcome SQL's complexity, making queries clearer and easier to write. SaneQL serves as a stepping stone toward SaneIR, a unified standard for relational data semantics. This unification of semantics supports interoperability between database components such as languages, optimizers, query engines, and storage systems. Enhanced interoperability reduces vendor lock-in and gives users greater flexibility.

CODAC advances current cloud analytics solutions through innovations such as BtrBlocks and Umami. BtrBlocks sets new standards in data compression speed and efficiency for cloud storage, significantly lowering storage costs and enabling faster data access, which directly benefits large-scale data-intensive applications across industries. Similarly, Umami uniquely combines efficient out-of-memory processing with near in-memory performance, fully leveraging modern cloud hardware. This results in substantially reduced query processing costs and enhanced scalability for big data analytics, particularly beneficial in scientific research, real-time analytics, and enterprise decision-making scenarios. Both innovations promise substantial cost savings, improved performance, and greater flexibility, significantly impacting data-driven innovation and competitiveness.

SaneIR has considerable potential impact as it addresses the significant challenges posed by inconsistent query processing semantics across different database systems. Currently, this inconsistency hampers interoperability and reinforces vendor lock-in as different systems might produce different result for the same queries. Adoption of SaneIR would enable a more interoperable, modular, and competitive data processing ecosystem, benefiting users through increased flexibility, reduced costs, and enhanced innovation in data-driven technologies.

To realize our vision of seamless, cost-effective scalability without sacrificing performance, further research and development is required. Moreover, efficient data migration solutions between major cloud platforms such as AWS, Azure, and Google Cloud must be developed to overcome vendor-specific limitations and costs. Creating integration tools and specialized "glue code" tailored to various cloud providers will be crucial for practical deployment and commercial viability. As for adoption, building user trust remains a big challenge, especially given the sensitive nature of data handled: Cloud providers offering data processing systems have had years to build trust, and have the names of big companies behind them. To overcome this challenge, we plan to rely on open source licenses, independent verification, and open communication to build user trust. Finally, addressing compliance with EU regulatory frameworks will ensure legal and operational compatibility for large-scale data processing.

Periodic Reporting for period 1 - CODAC (Commoditizing Data Analytics in the Cloud)

Partager cette page Partager cette page sur les réseaux sociaux

Télécharger Télécharger le contenu de la page