European Extreme Performing Big Data Stacks

Periodic Reporting for period 2 - E2DATA (European Extreme Performing Big Data Stacks)

Reporting period: 2019-07-01 to 2020-12-31

Imagine a Big Data application with the following characteristics: (i) it has to process large amounts of complex streaming data, (ii) the application logic that processes the incoming data must execute and complete within a strict time limit, and (iii) there is a limited budget for infrastructure resources. In today’s world, the data would be streamed from the local network or edge devices to a cloud provider which is rented by a customer to perform the data execution. The Big Data software stack, in an application and hardware agnostic manner, will split the execution stream into multiple tasks and send them for processing on the nodes the customer has paid for. If the outcome does not match the strict three second business requirement, then the customer has two options: 1) scale-up (by upgrading processors at node level), 2) scale-out (by adding nodes to their clusters), or 3) manually implement code optimizations specific to the underlying hardware.
However, the customer does not have the financial capability to achieve that. Ideally, they would like to achieve their business requirements without stretching their hardware budget. The natural question that arises is the following:
“How can we improve execution times while using less hardware resources?”
E2Data proposes an end-to-end solution for Big Data deployments that will fully exploit and advance the state-of- the-art in infrastructure services by delivering a performance increase of up to 10x while utilizing up to 50% less cloud resources.
E2Data will provide a new Big Data software paradigm of achieving the maximum resource utilization for heterogeneous cloud deployments without affecting current Big Data programming norms (i.e. no code changes in the original source). The proposed solution takes a cross-layer approach by allowing vertical communication between the four key layers of Big Data deployments (application, Big Data software, scheduler/cloud provider, and execution run time) which will allow the E2Data-enabled stack to adress the following question:
“How can the user establish for each particular business scenario which is the highest performing and cheapest hardware configuration?”

During the period covered by the report, the Heterogeneous-aware Big Data platform has been developed within T3.3 in WP3. Progress is described in subsection 1.2.3 of this report, and in more detail in deliverable D3.1 submitted during the first reporting period.
Furthermore, the Intelligent Scheduler architecture has been defined in T4.1 and prototypes for the Intelligence layer and the Automatic Decision-Making mechanism for elastic resource provisioning have been developed within T4.2 and T4.3 respectively. Progress related to this objective is described in subsection 1.2.4 and in more detail in deliverables D4.1 and D4.2 submitted during the first reporting period.
TornadoVM, the Heterogeneous Execution Engine of E2Data, developed in T5.2 has been made available as open source, already undergone three release cycles. Progress with regard to this objective is provided in subsection 1.2.5 and in more detail in deliverable D5.1.
The profiler tool has been developed within T5.1 and is undergoing code review in order to be integrated in TornadoVM. Progress related to this tool is provided in subsection 1.2.5.
Within T3.2 the consortium identified and verified the Cloud Resource Management Framework that will be used in E2Data. Progress is provided in subsection 1.2.3 and in more detail in deliverable D3.1.
An initial design with respect to the User Interface part of the components of the Big Data Visualisation Tool has been performed within T3.4. A description of the progress is provided in subsection 1.2.3.
Apart from the development-related progress identified on the table above, significant progress has also been achieved within WP2 in the front of the user requirements, described in subsection 1.2.2 and provided in more detail in deliverable D2.1. Within the same WP, progress has been done in the direction of developing the accelerated version of the use cases, based on the current version of the E2Data Heterogeneous execution engine, i.e. TornadoVM. This progress is reflected in deliverable D2.2.
With regard to the Integration and Evaluation of the E2Data Architecture, the architecture has been defined, the initial prototype of the E2Data stack has been developed and both of the two project testbeds including high-performing x86 and low-power ARM cluster architectures have been deployed and ready to use. Progress within the corresponding WP6, is described in subsection 1.2.6 and deliverables D6.1 and D6.2 submitted during the first reporting period.

The E2Data project has advanced the state-of-the-art of Big Data and Java on heterogeneous execution in the following aspects:
1. Propose a novel dynamic compilation framework that will allow our runtime system to adapt execution for the widest range of heterogeneous system,
2. Develop a runtime system and API that allows developers to easily compose complex multi-kernel codes,
3. Target a variety of parallel frameworks that build on top of Apache Flink (such as Storm and MapReduce),
4. Target a variety of hardware platforms including x86 and ARM AArch64 along with their accompanying GPU and FPGA units, and
5. Perform compilation and offloading on heterogeneous resources transparently to the user without source code modifications.

The E2Data project has advanced the state-of-the-art of scheduling and resource provisioning in the following aspects:
1. Implement an adaptive scheduling technique according to the user’s and application’s requirements where the E2Data hardware-aware, elastic scheduler will select the appropriate hardware and architecture to execute the respective tasks.
2. Design hardware-aware elastic algorithms to adaptively adjust resources in order to meet the requirements posed by the application developers/users while automatically identifying the application or infrastructural parameters that are important for the application’s performance thus tackling the “curse of dimensionality” problem.
3. Employ elastic scheduling techniques by extending the OpenStack nova-scheduler and Apache Mesos with additional filters to effectively handle heterogeneous resources and dynamically provide detailed information regarding the underlying hardware characteristics of the target platforms.

The dual impact in both Big Data practitioners and service providers/IaaS vendors is:
(a) Big Data practitioners will enjoy performance gains through both competitive pricing (by utilizing cheaper to operate and more efficient hardware) and optimal resource allocation (by automatically and elastically utilizing the correct amount and type of hardware resources and software configurations)
(b) Big Data service providers/IaaS vendors will be able to boost the adoption and increase utilization of the newly offered hardware, since it will be easily integrated and operated in cloud software services offered by Big Data practitioners, without requiring hard-to-find, expensive, hardware-specific programming skills.

E2Data Logo

Periodic Reporting for period 2 - E2DATA (European Extreme Performing Big Data Stacks)

Share this page

Download