Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning

Project Information

DAPHNE

Grant agreement ID: 957407

DOI

10.3030/957407

Project closed

EC signature date 14 July 2020

Start date 1 December 2020

End date 30 November 2024

Funded under

INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT)

Total cost

€ 6 609 665,00

EU contribution

€ 6 609 665,00

6 609 665,00

Coordinated by

KNOW CENTER RESEARCH GMBH
Austria

Periodic Reporting for period 3 - DAPHNE (Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning)

Reporting period: 2023-12-01 to 2024-11-30

The overall objective of DAPHNE is to define and build an open and extensible system infrastructure for integrated data analysis (IDA) pipelines. Such pipelines are complex workflows of data management and query processing, high-performance computing (HPC) and numerical simulations, as well as training and scoring of multiple machine learning (ML) models. Developing and deploying such IDA pipelines is still a painful process involving different systems and libraries, data exchange between these systems, inter-disciplinary development teams, different programming models and resource managers, which causes spatial-temporal underutilization of cluster hardware. Interestingly, data management, ML, and HPC share many compilation and runtime techniques, stress all aspects of the underlying hardware (HW), and thus, are affected by emerging HW challenges such as scaling limitations. These HW challenges lead to increasing specialization such as different data representations, HW accelerators, data placements, and specialized data types. Given this specialization, it becomes untenable to tune complex IDA pipelines for heterogeneous hardware. DAPHNE use cases include earth observation, semiconductor manufacturing, and automotive vehicle development, but there exist a wide variety of use cases that rely on ML-assisted simulations, data cleaning and augmentation, and exploratory query processing. In order to better support such use cases of IDA pipelines, DAPHNE’s strategic objectives include (1) a system architecture, APIs and DSLs (for developing such pipelines with seamless integration of existing systems and extensibility), (2) hierarchical scheduling and task planning for improved utilization of heterogeneous HW, as well as (3) evaluating this infrastructure on real-world use cases and benchmarks. Our efforts addressing these objectives are balanced across the open-source development of the DAPHNE system, selected foundational research projects for later integration, and a continuous refinement of the use case implementations and benchmarks.

We report on the work carried out across the work packages with close collaboration in line with our DAPHNE objectives.
While the first years of the DAPHNE project have been used to build up the system infrastructure, the progress towards the end has shifted from creation to refinement-oriented. This has already been visible in the last report and has been continued in this final project report. While the runtime system has seen many new kernels to enrich the functionality or improve the performance, there a lot of bug fixes and convenience features that were introduced for developers and users alike. For the general system architecture, the first version of the extensibility catalogue has been mainlined. The DAPHNE compiler gained more code generation capabilities, optimization and analysis passes. The API was extended continuously in our DaphneLib python bindings and our DaphneDSL has seen improvements for productivity like better support for indexing into matrix structures and higher level built-in functions, which in turn, sparked the development of internal features like the list data type, fostering the implementation of algorithms like the decision trees.
Besides refinement, the progress in the final project year towards reaching the goals of the second strategic objective was also marked by the development of larger features in separate branches to not interfere with general development in the main branch while they are not mature enough for merging. Relevant changes in this regard were the integration of distributed I/O through HDFS and Lustre support or improved memory management to support sparse data on GPU, support for the io_uring subsystem, higher dimensional tensor data and ongoing work in the computational storage and FPGA kernels. Work on hierarchical scheduling and NU-MA aware data placement was carried out to cater to the objective's task planning aspects, leading to performance improvements by countering load imbalance. Many of the techniques for this objective are based on features that have been introduced or improved in the DAPHNE compiler that analyzes workloads, input data and intermediates. Based on these outputs, decisions can be made on which methods are best applied to the problem at hand.
The progress towards completion of the use case implementations and our benchmarking efforts has been naturally increasing towards the end of the project as DAPHNE becomes more and more mature and therefore usable in real world application as well as benchmarking scenarios. In collaboration with the technical work packages (WPs 3-7), the work packages for use cases (WP8) and benchmarking (WP9) were not only able to reach their final stages, but also gave valuable feedback where problems still needed to be solved and improvements to be implemented. Through our successful inter-WP collaboration we were able to have most use cases produce their results with the UMLAUT benchmarking suite that has been implemented alongside the DAPHNE development in a separate code repository.
After the prototype was launched as announced in March 2022, DAPHNE has been open sourced (https://github.com/daphne-eu/daphne) under the Apache v2 license, and has since been improved according to DAPHNE artefact release schedule. As the Final Project Report in DAPHNE project has come to an end, we have implemented the DAPHNE language abstractions and the extended compiler prototype, shaping the scheduling structures available in DAPHNE, implementing relational filtering operators within the prototype and overview of data path optimizations and placement, enhancing the overview of the code generation framework as well as the DAPHNE pipelines, and implementing an internal benchmarking toolkit. Having submitted the respective deliverables by November 2024/M48/end of 3rd and last project period, we have met to discuss this progress on continuing our work on DAPHNE in our General Assembly Meeting 2024 at Impact Hub Athens. To summarize, we have made progress in line with all three strategic objectives and built the prototype and use case pipelines for facilitating the infrastructure and use case implementation on this infrastructure.

We have created the initial design and system architecture of the open and extensible DAPHNE system infrastructure based on MLIR as a multi-level intermediate representation. During this time, MLIR got broader community adoption for different dialects, optimization passes, and hardware accelerators. However, most projects focus on narrow aspects, not an end-to-end system for IDA pipelines. In contrast, the DAPHNE prototype was made open source in March 2022, and we have made progress according to our artefact release schedule, continuing to build a full system infrastructure. A major focus in the third project year have been the means of extensibility to enable researchers and developers to quickly explore and experiment while reusing the infrastructure. Additional advancements include selected research projects by the individual partners. Many of these results are continuously integrated back into the DAPHNE system and use cases. This reciprocal process of advanced development and foundational research aims to maximize impact, both scientifically and in practice.

DAPHNE System Architecture

Periodic Reporting for period 3 - DAPHNE (Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning)

Download Download the content of the page