Periodic Reporting for period 2 - E2DATA (European Extreme Performing Big Data Stacks)
Reporting period: 2019-07-01 to 2020-12-31
However, the customer does not have the financial capability to achieve that. Ideally, they would like to achieve their business requirements without stretching their hardware budget. The natural question that arises is the following:
“How can we improve execution times while using less hardware resources?”
E2Data proposes an end-to-end solution for Big Data deployments that will fully exploit and advance the state-of- the-art in infrastructure services by delivering a performance increase of up to 10x while utilizing up to 50% less cloud resources.
E2Data will provide a new Big Data software paradigm of achieving the maximum resource utilization for heterogeneous cloud deployments without affecting current Big Data programming norms (i.e. no code changes in the original source). The proposed solution takes a cross-layer approach by allowing vertical communication between the four key layers of Big Data deployments (application, Big Data software, scheduler/cloud provider, and execution run time) which will allow the E2Data-enabled stack to adress the following question:
“How can the user establish for each particular business scenario which is the highest performing and cheapest hardware configuration?”
Furthermore, the Intelligent Scheduler architecture has been defined in T4.1 and prototypes for the Intelligence layer and the Automatic Decision-Making mechanism for elastic resource provisioning have been developed within T4.2 and T4.3 respectively. Progress related to this objective is described in subsection 1.2.4 and in more detail in deliverables D4.1 and D4.2 submitted during the first reporting period.
TornadoVM, the Heterogeneous Execution Engine of E2Data, developed in T5.2 has been made available as open source, already undergone three release cycles. Progress with regard to this objective is provided in subsection 1.2.5 and in more detail in deliverable D5.1.
The profiler tool has been developed within T5.1 and is undergoing code review in order to be integrated in TornadoVM. Progress related to this tool is provided in subsection 1.2.5.
Within T3.2 the consortium identified and verified the Cloud Resource Management Framework that will be used in E2Data. Progress is provided in subsection 1.2.3 and in more detail in deliverable D3.1.
An initial design with respect to the User Interface part of the components of the Big Data Visualisation Tool has been performed within T3.4. A description of the progress is provided in subsection 1.2.3.
Apart from the development-related progress identified on the table above, significant progress has also been achieved within WP2 in the front of the user requirements, described in subsection 1.2.2 and provided in more detail in deliverable D2.1. Within the same WP, progress has been done in the direction of developing the accelerated version of the use cases, based on the current version of the E2Data Heterogeneous execution engine, i.e. TornadoVM. This progress is reflected in deliverable D2.2.
With regard to the Integration and Evaluation of the E2Data Architecture, the architecture has been defined, the initial prototype of the E2Data stack has been developed and both of the two project testbeds including high-performing x86 and low-power ARM cluster architectures have been deployed and ready to use. Progress within the corresponding WP6, is described in subsection 1.2.6 and deliverables D6.1 and D6.2 submitted during the first reporting period.
1. Propose a novel dynamic compilation framework that will allow our runtime system to adapt execution for the widest range of heterogeneous system,
2. Develop a runtime system and API that allows developers to easily compose complex multi-kernel codes,
3. Target a variety of parallel frameworks that build on top of Apache Flink (such as Storm and MapReduce),
4. Target a variety of hardware platforms including x86 and ARM AArch64 along with their accompanying GPU and FPGA units, and
5. Perform compilation and offloading on heterogeneous resources transparently to the user without source code modifications.
The E2Data project has advanced the state-of-the-art of scheduling and resource provisioning in the following aspects:
1. Implement an adaptive scheduling technique according to the user’s and application’s requirements where the E2Data hardware-aware, elastic scheduler will select the appropriate hardware and architecture to execute the respective tasks.
2. Design hardware-aware elastic algorithms to adaptively adjust resources in order to meet the requirements posed by the application developers/users while automatically identifying the application or infrastructural parameters that are important for the application’s performance thus tackling the “curse of dimensionality” problem.
3. Employ elastic scheduling techniques by extending the OpenStack nova-scheduler and Apache Mesos with additional filters to effectively handle heterogeneous resources and dynamically provide detailed information regarding the underlying hardware characteristics of the target platforms.
The dual impact in both Big Data practitioners and service providers/IaaS vendors is:
(a) Big Data practitioners will enjoy performance gains through both competitive pricing (by utilizing cheaper to operate and more efficient hardware) and optimal resource allocation (by automatically and elastically utilizing the correct amount and type of hardware resources and software configurations)
(b) Big Data service providers/IaaS vendors will be able to boost the adoption and increase utilization of the newly offered hardware, since it will be easily integrated and operated in cloud software services offered by Big Data practitioners, without requiring hard-to-find, expensive, hardware-specific programming skills.