Skip to main content

Reconfigurable non-von-Neumann Accelerators

Final Report Summary - EXAFLOW (Reconfigurable non-von-Neumann Accelerators)

Computers are everyone today, and their growing importance to our lives cannot be understated. From social media to banking, from our toaster to our car, the presence of computers is ubiquitous. Yet despite the omnipresence of computers in our lives, their underlying compute model has not changed much since it was invented by the great mathematician and physicist John von Neumann in the 1940s.

The von Neumann computing model decomposes a computation into separate operations (e.g. addition, subtraction, multiplication) and executes them one-at-a-time. Intermediate values calculated by these operations are stored in dedicated hardware storage constructs, which serve as inter-operation communication channels. As a result, the von Neumann model is very inefficient when it comes to energy consumption. In fact, modern von Neumann processors spend a mere 5% of their energy on pure computation. The rest of the energy is wasted on managing the computation through control circuitry and data transfers. This puts the energy efficiency of modern processors similar to that of incandescent light bulbs – which have been outlawed in most of the western world due to their energy waste.

Today, general purpose graphical processing units (GPGPUs) are considered the most energy efficient von Neuman processors (in terms of operations per Watt of power, orOPs/Watt). These massively parallel, high-throughput processors are growing in popularity even though they still suffer from the von Neumann model power inefficiencies, as they must repeatedly fetch and decode each instruction, and must use explicit storage to communicate intermediate results between instructions.

Interestingly, the same factors that adversely affect the energy efficiency of GPGPUs are obviated in coarse-grain reconfigurable arrays (CGRA). These compute elements comprise of a fabric of trivial functional units that compute basic arithmetic operations. The distributed nature of FPGAs lends itself to simplifying control circuitry, reducing long-distance data transfers, and eliminating the need for explicit storage for intermediate values – the same set of properties that burden von Neumann architectures. In a CGRA, the compute operations (nodes in the dataflow graph) are statically mapped to functional units, and an interconnect is configured to transfer values between functional units based on the graph's connectivity. Compared to von Neumann architectures, the distributed control and static instruction mapping obviate the instruction pipeline, and direct communication between functional units obviates the centralized register file.

In this research we have designed a high-throughput, massively multithreaded CGRA processor as power efficient alternative to GPGPUs. Our research group has presented a novel, high-throughput CGRA execution model and architecture that outperforms contemporary GPGPUs by a factor of 2x-20x while consuming half the energy per operation (ops/Joule). Multithreaded coarse-grain reconfigurable (MT-CGRA) execution and architectural model combines a CGRA with a dynamic dataflow execution model to accelerate execution throughput of massively thread-parallel code. Notably, this research track distinctively differs from past research on CGRAs that focused exclusively on single-thread performance

On top of the initial compute model and processor design, we furthered our study on massively multithreaded CGRAs by addressing some key challenges in this novel compute model. Concretely, we studied the design of a cache and memory system for high-throughput CGRAs, whose memory workloads challenge existing high-throughput cache designs. In addition, as CGRAs’ performance and power efficiency highly affected by the utilization of their fabric of functional units, we have studied methods to extend the new execution model in order to maximize the utilization, and thereby performance and power efficiency, of the underlying CGRA.

This research carries immense societal challenges, as the implication of a programmable, high-performance processor that is 2-20x more energy-efficient than the state-of-the-art are substantial. Human development relies on major leaps in computing technology, which can facilitate new pathways that impact our lives - from artificial intelligence that assist us in our daily lives, to pharmaceutical research that extends them. The massive performance and energy-efficiency improvements enabled by this research offers both academia and industry a new vehicle to drive computing technology forward.