As HPC architectures have evolved, the HPC market has undergone a paradigm shift. The adoption of low-cost, Linux-based clusters offer significant computing performance and the ability to run a wide array of applications. These new classes of HPC applications are becoming increasingly performance and power hungry, pushing the boundaries of HPC systems to the limits. On the other hand, existing HPC systems cannot provide exascale performance because of their power limitations.
ECOSCALE tackles this challenge by proposing a scalable programming environment and hardware architecture tailored to the characteristics and trends of current and future HPC applications, reducing significantly the data traffic as well as the energy consumption and delays. ECOSCALE introduces a novel heterogeneous energy-efficient architecture, programming model and runtime system which follow a hierarchical approach where the system is partitioned into multiple autonomous Workers (i.e. compute nodes). Workers are interconnected in a tree-like structure in order to form larger Partitioned Global Address Space (PGAS) partitions, which can be further hierarchically interconnected via an administrative and message passing protocol.
ECOSCALE resulted in the manufacturing of a 256-core heterogeneous prototype that, effectively, offers a small-scale HPC system interconnecting several nodes together. Specifically, the UNILOGIC architecture has been evaluated on a custom prototype that consists of two 1U chassis. Each chassis includes eight interconnected daughter boards, called Quad-FPGA Daughter Boards (QFDBs), and each QFDB supports four tightly coupled Xilinx Zynq Ultrascale+ MPSoCs as well as 64 Gigabytes of DDR4 memory. Thus, the prototype features 64 Zynq MPSoCs and 1 Terabyte of memory in total. The ECOSCALE framework and platform have been evaluated on this prototype using low-level baremetal benchmarks as well as the ECOSCALE use cases that represent two popular HPC real-world applications, one compute-intensive and one data-intensive, i.e. oil-reservoir simulation and traffic video-data (smart-city) analysis. Based on the measured results, with respect to the two particular use cases, ECOSCALE has achieved notable improvements in performance that range from 2.5 to 400 times faster and 46 to 300 times more energy efficient compared to conventional parallel systems utilizing only high-end CPUs. Moreover, based on the evaluation of the two real-world applications, UNILOGIC scales almost linearly while it allows for all the reconfigurable resources in the parallel system to be utilized as if they were in a single large device. The performance overhead in order to support remote accesses of reconfigurable resources is less than 4%.
Notably, the ECOSCALE architecture supports heterogeneous systems, incorporating CPUs and reconfigurable logic in an efficient manner. It is also scalable so as to be able to support thousands of nodes, while offering virtualized reconfigurable resources as well as unified and system-wide memory accesses. In order to increase programmability, the architecture offers a unified environment where all the reconfigurable resources can be seamlessly used by any processor/operating system. This means that, with ECOSCALE, the application developer will not have to be aware of where the hardware accelerators and/or his/her data items are placed in the parallel system.
Finally, our evaluation demonstrates that the energy efficiency offered by the prototype is comparable to state-of-the-art HPC systems, using newer transistor technologies. In particular, it offers 9 to 17 GFLOPS per watt using the Xilinx 16nm UltraScale+ devices.