Skip to main content

MANGO: exploring Manycore Architectures for Next-GeneratiOn HPC systems

Periodic Reporting for period 1 - MANGO (MANGO: exploring Manycore Architectures for Next-GeneratiOn HPC systems)

Reporting period: 2015-10-01 to 2017-03-31

HPC is becoming the enabler of our IT-based society. To advance HPC, we need to approximate to the Intrinsic Computational Efficiency (ICE), which is the energy consumption per operation of computational units (eg. a FPU unit). To achieve such efficiency, heterogeneous systems enable power-efficient designs; where different types of devices (CPUs, GPUs, custom accelerators, FGPAs) are mixed and used only for the cases they are efficient. As they have different power/performance trade-offs, there is a parallel effort in matching resources to running applications.

MANGO explores new architectures for future HPC systems taking as central point heterogeneity. MANGO tries to answer the question: How we combine heterogeneous components and how we program/manage them for the best achievement of computational efficiency?.

In addition, there are emerging HPC requirements: 1) guaranteeing predictability to new applications (as HPC merges with BigData), 2) providing capacity computing (running as many unrelated applications as possible). MANGO targets these two new requirements, exploring the so-called 3P space domain: Performance, Power consumption and Predictability.

MANGO is building a prototype, which will enable rapid architecture exploration. The prototype, while guaranteeing flexibility, poses per se a big challenge.

MANGO aims for the following short-term goals:

- Develop a flexible prototype for architecture exploration
- Explore new heterogeneous architectures
- Real-time support in the PPP design space
- Unified and simple access via a smart interconnect
- Adapt programming models and compilers to new architectures
- Develop proper resource manager
- New monitoring tools
- New cooling techniques
- Analyse impact on a set of real applications (video transcoding, medical imaging, security and surveillance)
The Consortium has mainly focused on the design and integration of components that will build up the complete system. We divide (approx.) activities in three groups: Definition and phase1 platform delivery (M1-M8), components design (M5-M14), system integration (M12-M18).

During M1-M8 the Consortium focused on the definition of specifications for Applications, Hardware, and Software. In parallel, Phase1 platform was delivered and allowed first components to be deployed.

Partly overlapped (M5-M14), partners implemented basic SW/HW components: accelerators, network, compiler support, resource manager, ... Finally, during the last months (M12-M18) large effort has been put to come with a fully integrated solution. The system now runs multiple applications, which trigger kernels on heterogeneous components via the resource manager. It enables prototype deployment, development of customized applications and the optimization of both, system and applications, scheduled for the second half of the project.

Results:

- Partners furnished with 16 sets of emulation systems for rapid prototyping
- Fully configurable manycore accelerator
- LLVM Compiler support for the manycore
- Interconnect infrastructure for the heterogeneous system
- Heterogeneous infrastructure enabling rapid design exploration
- Common interface enabling accelerators to coexist
- Virtual addressing scheme for accelerators for unified memory access
- Baseline GPU-like processor
- Initial LLVM compiler extensions for GPU-like core
- Several versions of initial system baseline application testbed
- First HW accelerator tested on GPU-like processor
- Initial version of bare medical rendering algorithm
- Numerical model and working design of thermosyphon cooling system
- Thermal modelling framework and GUI based on 3D-ICE for the x86 i7 server
- Simulation framework for performance, power and temperature assessment
- Performance- power- and temperature-aware resource management policies for HEVC transcoding
- Initial port of Barbeque Runtime Resource Management to ARM-based platform
- Initial port to Linux of LDPC algorithmic reference code
Current achievements promise significant impact, during 2nd term, with fully configurable/flexible prototype for HPC exploration. The target is to exploit the system achieved to first optimize the design for performance and predictability, and then explore architectures.

Specific progress:
- The manycore accelerator sets an important achievement of a configurable/flexible accelerator for manycore exploration. Alternative designs (e.g. OpenRISC) are not flexible enough to let custom coherence protocol support and advanced networks
- The Nu+ GPU-like processor allows largest degree of freedom for exploration of advanced architectures and trade-offs between resource utilization and complexity. In computation, it provides an effective solution as FPGA overlay, offering an approach to build tailored processing elements, reaching higher levels of resource efficiency through customization
- Initial HW numeric accelerator is aimed at providing methodology to demonstrate design of computationally efficient program-less core. The initial accelerator addresses DCT transform, one of the computationally most demanding operations in transcoding
- The network for the heterogeneous system has high impact. It enables intra- and inter- FPGA communication on the same or different motherboards. It enables access to blade systems. It supports design-time instantiation of virtual networks and runtime allocation of bandwidth quotas for QoS
- The platform for heterogeneity support enables accelerators, memories and interconnect to be easily instantiated and implemented on a multi-FPGA system. This platform is unique in its view as it deals with physical constraints set by target hardware and logical constraints set by target HN system. UPV will use it beyond MANGO as a fast multi-FPGA tool
- Virtual address translation enables resource manager to customize the memory address space of each accelerator, enabling full memory configuration and management
- Initial HW application accelerators on PEAK/NU+ enables exploring architectural and application integration on ANY combination of any general and heterogeneous core
- The thermosyphon cooling system + 3D-ICE thermal modeling framework advances over air-based solutions for thermal control. The two-phase cooling principle enables removal of large heat fluxes ensuring chip-reliability and reducing thermal gradients. Its gravity-driven principle allows reducing cooling costs. Reduction of power consumption in cooling allows for increased power densities
- 3D-ICE based thermal modelling framework, with numerical simulation of thermosyphon, together with Gem5-based power/performance framework enables the proposal of proactive thermal/power management schemes
- Proactive techniques, in coordination with resource manager and its application to customized boards provides a much more balanced temperature than latest reactive thermal solutions, bringing improved resilience and fault-free behavior
- Runtime management allows fine-grained management of heterogeneous resources, typically not available in HPC and enables effective capacity computing
- Applications in the domains of networking infrastructures and satellite telecommunications, optimizing computation capabilities under real-time and power-efficiency constraints
- Motherboards development under aspects of cooling/thermal/power and of a specific rack mount cage. Provision of newer FPGA technologies: Kintex Ultrascale, Stratix-10, Zynq Ultrascale+, and Zynq-7000
Basic FPGA motherboard used in MANGO platform