Skip to main content

Energy-efficient Heterogeneous COmputing at exaSCALE

Periodic Reporting for period 2 - ECOSCALE (Energy-efficient Heterogeneous COmputing at exaSCALE)

Reporting period: 2017-04-01 to 2019-07-31

As HPC architectures have evolved, the HPC market has undergone a paradigm shift. The adoption of low-cost, Linux-based clusters offer significant computing performance and the ability to run a wide array of applications. These new classes of HPC applications are becoming increasingly performance and power hungry, pushing the boundaries of HPC systems to the limits. On the other hand, existing HPC systems cannot provide exascale performance because of their power limitations.

ECOSCALE tackles this challenge by proposing a scalable programming environment and hardware architecture tailored to the characteristics and trends of current and future HPC applications, reducing significantly the data traffic as well as the energy consumption and delays. ECOSCALE introduces a novel heterogeneous energy-efficient architecture, programming model and runtime system which follow a hierarchical approach where the system is partitioned into multiple autonomous Workers (i.e. compute nodes). Workers are interconnected in a tree-like structure in order to form larger Partitioned Global Address Space (PGAS) partitions, which can be further hierarchically interconnected via an administrative and message passing protocol.

ECOSCALE resulted in the manufacturing of a 256-core heterogeneous prototype that, effectively, offers a small-scale HPC system interconnecting several nodes together. Specifically, the UNILOGIC architecture has been evaluated on a custom prototype that consists of two 1U chassis. Each chassis includes eight interconnected daughter boards, called Quad-FPGA Daughter Boards (QFDBs), and each QFDB supports four tightly coupled Xilinx Zynq Ultrascale+ MPSoCs as well as 64 Gigabytes of DDR4 memory. Thus, the prototype features 64 Zynq MPSoCs and 1 Terabyte of memory in total. The ECOSCALE framework and platform have been evaluated on this prototype using low-level baremetal benchmarks as well as the ECOSCALE use cases that represent two popular HPC real-world applications, one compute-intensive and one data-intensive, i.e. oil-reservoir simulation and traffic video-data (smart-city) analysis. Based on the measured results, with respect to the two particular use cases, ECOSCALE has achieved notable improvements in performance that range from 2.5 to 400 times faster and 46 to 300 times more energy efficient compared to conventional parallel systems utilizing only high-end CPUs. Moreover, based on the evaluation of the two real-world applications, UNILOGIC scales almost linearly while it allows for all the reconfigurable resources in the parallel system to be utilized as if they were in a single large device. The performance overhead in order to support remote accesses of reconfigurable resources is less than 4%.

Notably, the ECOSCALE architecture supports heterogeneous systems, incorporating CPUs and reconfigurable logic in an efficient manner. It is also scalable so as to be able to support thousands of nodes, while offering virtualized reconfigurable resources as well as unified and system-wide memory accesses. In order to increase programmability, the architecture offers a unified environment where all the reconfigurable resources can be seamlessly used by any processor/operating system. This means that, with ECOSCALE, the application developer will not have to be aware of where the hardware accelerators and/or his/her data items are placed in the parallel system.

Finally, our evaluation demonstrates that the energy efficiency offered by the prototype is comparable to state-of-the-art HPC systems, using newer transistor technologies. In particular, it offers 9 to 17 GFLOPS per watt using the Xilinx 16nm UltraScale+ devices.
The work carried out in the project is described in the two Periodic Technical Reports (PTR) in detail.

Work of technical nature took place in the context of Work Package 2 (WP2). WP2 Tasks 2.1 and 2.2 were completed within the first four and six months of the project accordingly. The other two tasks were active in the 2nd period of the project, i.e. i) T2.3 Development of Reservoir Simulation Application and ii) T2.4 Development of Smart-City Application, which were completed by month 36.

Work Package 3 (WP3) is, similar to WP2, a technical work package. In this case, three out of four tasks were completed within the first half of the project, i.e. tasks 3.1 - 3.3. The only task that continued in the 2nd period of the project is i) T3.4 Co-design Inspection, which was completed by month 36.
The technical work branches out to the tasks of WP4, WP5 and WP6.

Work Package 4 also has several different tasks that started in the first half of the project, which continued until its later stages. Tasks i) T4.1 Procurement of HW components and ii) T4.2 Prototype Platform, have been extended in order to provide an initial small prototype for SW and HW development. The final prototype was completed towards the end of the project. Tasks iii) T4.3 UNIMEM Design, iv) T4.4 UNILOGIC Deign, v) T4.5 Central Router Design, vi) T4.6 Prototype Integration and Testing and vii) T4.7 Modelling, Simulation and Architectural Parameter Optimization commenced within the first half of the project and were completed by month 39.

As mentioned above, WP5 also contains work of a technical nature. Four of the seven WP5 tasks were finished within the first half of the project, more specifically tasks 5.1 to 5.4. The rest of the tasks, namely i) T5.5 Runtime FPGA Resources Management and Module Placement, and i) T5.6 Reconfigurable Resources Monitoring were initiated within the first half of the project and the work carried on until month 34. Task iii) T5.7 Integrated reconfigurable computing tool flow was initiated and completed within the second half of the project.

Work package 6 is also a technical WP and out of the six tasks, the first two (T6.1 and T6.2) were completed within the first half of the project. Two tasks were initiated in the first half, i.e. i) T6.3 Hardware-Software Partitioning Models and Algorithms and ii) T6.4 OpenCL Hardware-Software Partitioning Implementation, whereas tasks iii) T6.5 Coordinated Hardware-Software Resilience in OpenCL and iv) T6.6 Runtime system integration and Coordinated Resilience in MPI/OpenCL were initiated at the second half of the project. These four tasks were completed by month 38.
ECOSCALE does not simply provide another offloading engine but it goes beyond the state of the art proposing and developing a Global Distributed Reconfigurable Logic where reconfigurable resource are transparently shared between the applications. Moreover ECOSCALE provides for the first time an heterogeneous infrastructure, runtime system and tool flow which can automatically load HW tasks anywhere in the reconfigurable resources of the ECOSCALE system. This functionality is offered to the programmer through a user-friendly programming model which extends OpenCL in order to access multiple distributed reconfigurable resources.

The target of ECOSCALE is to provide the technological advancement in order to derive the first energy-efficient exascale system unifying and extending existing architectures, and programming environments as well as incorporating them with novel reconfigurable systems. The expected potential impact of ECOSCALE is described in the DoA. No update is needed.
ECOSCALE Worker Architecture
ECOSCALE reconfiguration
ECOSCALE Gantt Chart