Skip to main content
European Commission logo
français français
CORDIS - Résultats de la recherche de l’UE
CORDIS

Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning

Periodic Reporting for period 2 - DAPHNE (Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning)

Période du rapport: 2022-06-01 au 2023-11-30

The overall objective of DAPHNE is to define and build an open and extensible system infrastructure for integrated data analysis (IDA) pipelines. Such pipelines are complex workflows of data management and query processing, high-performance computing (HPC) and numerical simulations, as well as training and scoring of multiple machine learning (ML) models. Developing and deploying such IDA pipelines is still a painful process involving different systems and libraries, data exchange between these systems, inter-disciplinary development teams, different programming models and resource managers, which causes spatial-temporal underutilization of cluster hardware. Interestingly, data management, ML, and HPC share many compilation and runtime techniques, stress all aspects of the underlying hardware (HW), and thus, are affected by emerging HW challenges such as scaling limitations. These HW challenges lead to increasing specialization such as different data representations, HW accelerators, data placements, and specialized data types. Given this specialization, it becomes untenable to tune complex IDA pipelines for heterogeneous hardware. DAPHNE use cases include earth observation, semiconductor manufacturing, and automotive vehicle development, but there exist a wide variety of use cases that rely on ML-assisted simulations, data cleaning and augmentation, and exploratory query processing. In order to better support such use cases of IDA pipelines, DAPHNE’s strategic objectives include (1) a system architecture, APIs and DSLs (for developing such pipelines with seamless integration of existing systems and extensibility), (2) hierarchical scheduling and task planning for improved utilization of heterogeneous HW, as well as (3) evaluating this infrastructure on real-world use cases and benchmarks. Our efforts addressing these objectives are balanced across the open-source development of the DAPHNE system, selected foundational research projects for later integration, and a continuous refinement of the use case implementations and benchmarks.
Similar to the annual report in D1.5 and the Periodic Report for project period 2 until 11/2023, we report on the work carried out across the work packages with close collaboration.

Project Management / Dissemination (WP 1 and 10): Besides regular all-hands meetings, in the first 36 months comprising project tracking according to project plan and communicating project news on a monthly basis, DAPHNE cloud, hosted by Know-Center has been maintained to ensure a smooth sharing of project files, materials, and documents. The specific tasks in project management carried out since project runtime are documented in the project and risk management plan D1.1 the research data management plan D1.2 the first annual report D1.3 and the second annual report D1.4 and up to project month 36/November 2023 the third annual report D1.5.
System Architecture and DSL (WP 2 and 3): Since February 2021, we actively develop a prototype of the DAPHNE system, which was shared as a demonstrator in D3.2 and as of March 31, 2022 has been migrated to a public OSS repository with Apache 2 license. Afer elaborating on the refined system architecture in D2.1 and D2.2 as well as compiler designs in D3.1 and the intial prototype in D3.2 we have developed the extended Daphne compiler prototype, which is MLIR based and acts as a library of compiler infrastructure to faciilitate a cost-effective development of our domain-specific language, reuse of compiler infrastructure, and extensibility.

Runtime and Scheduling (WP 4 and 5): Discussions in WP 4 and 5 combined knowledge sharing of selected techniques, and in-depth discussions of runtime aspects of the prototype and its extensions. Initial efforts centered around the core data structures and kernels. We introduced a vectorized (tiled) execution engine that processes operator pipelines in a task-based manner on tiles of inputs. The design is described in the system architecture in D2.1 language abstractions in D3.1 the DSL runtime design in D4.1 and the scheduler design in D5.1. Beyond the local runtime, we also created an initial distributed runtime system, which uses hierarchical vectorized pipelines. Additional work investigated distribution primitives, collective operations (e.g. MPI), parameter servers, and distribution strategies. For hierarchical scheduling, we already analyzed requirements, and explored various task scheduling strategies.

Computational Storage and HW Accelerators (WP 6 and 7): Work packages 6 and 7 also have natural synergies. Besides knowledge sharing, initial work of the first 18 months covered basic I/O support for selected data formats, an analysis of the design space and current technology trends in D6.1 as well as an initial integration of GPU and FPGA operations, related data placement primitives, and tailor-made device kernels for selected operations (e.g. FPGA quantization). The integration of GPU and FPGA accelerators is important for performance of various end-to-end pipelines, and serve as examples for integrating other HW accelerators. GPUs (and later FPGAs) are also part of vectorized execution to exploit heterogeneous HW jointly.

Use Cases and Benchmarks (WP 8 and 9): The work packages 8 and 9 conducted regular meetings for discussions of the individual use cases, the use case descriptions, and ML pipeline implementations. A major outcome are the use case pipelines in D8.1 which serve as example use cases for the DAPHNE system and real-world benchmarks. We further surveyed existing benchmarks in databases, data-parallel computation, HPC, and ML systems in D9.1. Additionally, HPI made major contributions to the development of the TPCx-AI benchmark (released in 09/2021) and several partners (HPI, UNIBAS, KNOW) conducted student projects for benchmarking IDA pipelines and additional TPCx-AI implementations. The focus of the third project year has been to bring the bottom-up developed DAPHNE system closer to the top-down developed use cases.
During the first 36 months, we created the initial design and system architecture of the open and extensible DAPHNE system infrastructure based on MLIR as a multi-level intermediate representation. During this time, MLIR got broader community adoption for different dialects, optimization passes, and hardware accelerators. However, most projects focus on narrow aspects, not an end-to-end system for IDA pipelines. In contrast, the DAPHNE prototype was made open source in March 2022, and we have made progress according to our artefact release schedule, continuing to build a full system infrastructure. A major focus in the third project year have been the means of extensibility to enable researchers and developers to quickly explore and experiment while reusing the infrastructure. Additional advancements include selected research projects by the individual partners. Many of these results are continuously integrated back into the DAPHNE system and use cases. This reciprocal process of advanced development and foundational research aims to maximize impact, both scientifically and in practice.
DAPHNE System Architecture