Periodic Reporting for period 3 - OPRECOMP (Open transPREcision COMPuting)
Reporting period: 2019-07-01 to 2020-12-31
OPRECOMP aims at demolishing the ultra-conservative “precise” computing abstraction and replacing it with a more flexible and efficient one, namely Transprecision Computing. This is rooted into the key intuition of exploiting approximation in both hardware and software to boost energy efficiency. While this is clearly tied to the rapidly developing research area known as approximate computing, OPRECOMP aims to push beyond the state of the art along several axes. First, it aims at controlling approximation in space and time (when and where) at a fine grain though multiple hardware and software feedback control loops. Second, OPRECOMP aims at demonstrating that approximation (even a well-controlled one) during computation does not imply reduced precision at the application level, even though it is also possible to exploit application-level softening of precision requirements for extra benefits. Third, OPRECOMP takes inspiration from nature by defining computing architectures that operate with a smooth and wide range of precision vs. cost trade-off curve.
OPRECOMP goal is to demonstrate the breakthrough potential of the transprecision computing approach in two real-life computing scenarios with major market relevance. On the one hand, OPRECOMP targets to achieve at least one order of magnitude improvement in energy efficiency for a computing system working with a few mW power budget (e.g. an IoT end-device for near-sensor processing) and based on ETHZ open-source PULP architecture. On the other hand, OPRECOMP target the same level of energy efficiency boost for a computing system at the kW-scale, consisting of an IBM POWER node based on the IBM’s OpenPOWER technology.
1) Transprecision-boosted applications.
OPRECOMP identified twelve micro-benchmarks, in three major areas of computing. For these micro-benchmarks, OPRECOMP has developed a reference baseline scalar, parallel, and GPU implementations. Time, power, and energy costs w.r.t. different workload (e.g. varying input size) have been measured, to characterize current state-of-the-art systems. A first set of transprecision algorithms to accelerate and reduce energy-cost of the micro-benchmarks have been developed. More in detail, at present time OPRECOMP demonstrated considerable acceleration in PageRank, BLSTM, CG, SpMV, SVD and others.
2) Establishment of a full transprecision framework for computing.
OPRECOMP has established the basic ground for the theoretical and experimental (quality metrics) analysis, of the effects of transprecision on the micro-benchmarks identified in the project. Tools to emulate effect of transprecision (accuracy and error bound) through an intuitive software framework have been developed. OPRECOMP also explored application characteristics (including automated precision tuning tool), programming model and initial version of transprecision compiler to design and build a transprecision software stack.
3) Sustainable HPC to Exascale and beyond.
OPRECOMP is building kw-demonstrator for transprecision computing. The project has developed a testing environment attaching PULP to an OpenPOWER-based system through CAPI. OPRECOMP has developed the appropriate library for establishing this connection, alongside sample applications, which form the baseline templates for the porting of OPRECOMP's micro-benchmarks. For early prototyping and debugging, an emulator of the kw-system by coupling the PULP virtual platform to an OPENPOWER-based system has been developed. On the PULP side, OPRECOMP has developed new functional units, processing elements and memory hierarchy structures that exploit transprecision characteristics.
4) Energy-neutral near-sensor processing.
OPRECOMP has been actively working on two IoT platforms (PULP and GAP8). These platforms already include some early transprecision support and will be made available to the full OPRECOMP consortium to develop and test benchmark applications. The project has also worked on alternative short bit-width floating point representations with 8 and 16-bit, and these have already been implemented and benchmarked in hardware. A further improvement has been a complete floating-point unit that provides support for not only the basic ADD and MUL instructions but also DIV and SQRT units.
5) Pathfinding for disruptive technologies.
First investigations in the direction of transprecision memories, for example approximate DRAM and variation analysis of Resistive RAMs, have been carried out. OPRECOMP explored DRAM’s power down modes in a full-system simulator and quantified their impact, which is critical for all kind of DRAM subsystems. The project also investigated the refresh penalty of DRAMs in two flavors and used the insights of vendor-specific DRAM architectures to optimize the error correction capabilities.
1) Establishing the first full framework based on transprecision: this implies (i) a theoretical and practical understanding of energy efficiency gain from lifting accuracy in intermediate calculations (e.g. running on unreliable hardware) (ii) demonstrate generality, applicability and impact into many domains and applications (Big Data, Cognitive, HPC, etc.), (iii) establish the principles of working with both deterministic and statistical approximation (iv) control the approximation, bound the error and guarantee the result (v) demonstrate benefit of transprecision on wide range of platforms (mW and kW pilots).
2) Establishing the first international transprecision computing community. This will be achieved through open dissemination of the results, tools, benchmarks, and code developed in the project. The community will include scientists in all the main project pillars, i.e.: Physical Foundations, Mathematical Theory, Distruptive Technology, Architectures & Circuits, Software Env. ad Tools, Algorithms, and Applications.
OPRECOMP will build two fully operational transprecision computing systems: (i) a processor for near-sensor content understanding for IoT applications (few mW power envelope); (ii) a HPC node coupling high-performance precise cores with an imprecise massively parallel accelerator (sub-kW power envelope). The two demonstrators will be used to show that uncompromised quality with scalable order-of-magnitude time- and energy-to-solution reduction is reachable for intensive applications in the fields of Data Analytics, Simulations and Deep Learning.