Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning

Periodic Reporting for period 1 - DAPHNE (Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning)

Periodo di rendicontazione: 2020-12-01 al 2022-05-31

The overall objective of DAPHNE is to define and build an open and extensible system infrastructure for integrated data analysis (IDA) pipelines. Such pipelines are complex workflows of data management and query processing, high-performance computing (HPC) and numerical simulations, as well as training and scoring of multiple machine learning (ML) models. Developing and deploying such IDA pipelines is still a painful process involving different systems and libraries, data exchange between these systems, inter-disciplinary development teams, different programming models and resource managers, which causes spatial-temporal underutilization of cluster hardware. Interestingly, data management, ML, and HPC share many compilation and runtime techniques, stress all aspects of the underlying hardware (HW), and thus, are affected by emerging HW challenges such as scaling limitations. These HW challenges lead to increasing specialization such as different data representations, HW accelerators, data placements, and specialized data types. Given this specialization, it becomes untenable to tune complex IDA pipelines for heterogeneous hardware. DAPHNE use cases include earth observation, semiconductor manufacturing, and automotive vehicle development, but there exist a wide variety of use cases that rely on ML-assisted simulations, data cleaning and augmentation, and exploratory query processing. In order to better support such use cases of IDA pipelines, DAPHNE’s strategic objectives include (1) a system architecture, APIs and DSLs (for developing such pipelines with seamless integration of existing systems and extensibility), (2) hierarchical scheduling and task planning for improved utilization of heterogeneous HW, as well as (3) evaluating this infrastructure on real-world use cases and benchmarks. Our efforts addressing these objectives are balanced across the open-source development of the DAPHNE system, selected foundational research projects for later integration, and a continuous refinement of the use case implementations and benchmarks.

Similar to the annual report in D1.3 we report on the work carried out in pairs of work packages with close collaboration.

Project Management / Dissemination (WP 1 and 10): Besides regular all-hands meetings, in the first 18 months, we focused on the setup of the project infrastructure. Related outcomes are the project and risk management plan D1.1 the research data management plan D1.2 and the first annual report D1.3. Furthermore, we organized the kickoff meeting in 12/2020, the first general assembly meeting in 10/2021, and are organizing the first review meeting in 07/2022. An initial website was replaced by a new website with more information, use cases, talks, and publications. Besides papers and talks, we conducted broad dissemination and exploitation activities and have refined the dissemination and exploitation plan in D10.1.

System Architecture and DSL (WP 2 and 3): After many discussions, we summarized the requirements on an open and extensible system infrastructure, and defined its system architecture and components. This system architecture was documented in D2.1 and a joint CIDR 2022 paper by all partners. We defined the DAPHNE language abstractions of DaphneDSL (a domain-specific language), and DaphneAPI (a Python API) in D3.1. In this context, we already described the initial design of the MLIR-based optimizing compiler, DaphneIR as the central intermediate representation, and future extensions by higher-level built-ins. Since February 2021, we actively develop a prototype of the DAPHNE system, which was shared as a demonstrator in D3.2 and as of March 31, 2022 has been migrated to a public OSS repository (https://github.com/daphne-eu/daphne) with Apache 2 license.

Runtime and Scheduling (WP 4 and 5): Discussions in WP 4 and 5 combined knowledge sharing of selected techniques, and in-depth discussions of runtime aspects of the prototype and its extensions. Initial efforts centered around the core data structures and kernels. We introduced a vectorized (tiled) execution engine that processes operator pipelines in a task-based manner on tiles of inputs. The design is described in the system architecture in D2.1 language abstractions in D3.1 the DSL runtime design in D4.1 and the scheduler design in D5.1. Beyond the local runtime, we also created an initial distributed runtime system, which uses hierarchical vectorized pipelines. Additional work investigated distribution primitives, collective operations (e.g. MPI), parameter servers, and distribution strategies. For hierarchical scheduling, we already analyzed requirements, and explored various task scheduling strategies.

Computational Storage and HW Accelerators (WP 6 and 7): Work packages 6 and 7 also have natural synergies. Besides knowledge sharing, initial work of the first 18 months covered basic I/O support for selected data formats, an analysis of the design space and current technology trends in D6.1 as well as an initial integration of GPU and FPGA operations, related data placement primitives, and tailor-made device kernels for selected operations (e.g. FPGA quantization). The integration of GPU and FPGA accelerators is important for performance of various end-to-end pipelines, and serve as examples for integrating other HW accelerators. GPUs (and later FPGAs) are also part of vectorized execution to exploit heterogeneous HW jointly. More specialized work focused on virtual vector abstractions for SIMD, computational storage platforms and initial experiments, exploration of abstractions for complex storage hierarchies, and performance models.

Use Cases and Benchmarks (WP 8 and 9): The work packages 8 and 9 conducted regular meetings for discussions of the individual use cases, the use case descriptions, and ML pipeline implementations. A major outcome are the use case pipelines in D8.1 which serve as example use cases for the DAPHNE system and real-world benchmarks. We further surveyed existing benchmarks in databases, data-parallel computation, HPC, and ML systems in D9.1. Additionally, HPI made major contributions to the development of the TPCx-AI benchmark (released in 09/2021) and several partners (HPI, UNIBAS, KNOW) conducted student projects for benchmarking IDA pipelines and additional TPCx-AI implementations.

During the first 18 months, we created the initial design and system architecture of the open and extensible DAPHNE system infrastructure based on MLIR as a multi-level intermediate representation. During this time, MLIR got broader community adoption for different dialects, optimization passes, and hardware accelerators. However, most projects focus on narrow aspects, not an end-to-end system for IDA pipelines. In contrast, the DAPHNE prototype was made open source in March 2022, and we continue building a full system infrastructure. This infrastructure sets the foundation for broader impact in the next years. A major focus for the immediate future are means of extensibility to enable researchers and developers to quickly explore and experiment while reusing the infrastructure. Additional advancements include selected research projects by the individual partners. Many of these results will be later integrated back into the DAPHNE system and use cases. This balance of advanced development and foundational research aims to maximize impact, both scientifically and in practice.

DAPHNE System Architecture

Periodic Reporting for period 1 - DAPHNE (Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning)

Condividi questa pagina

Scarica