CoCoUnit: An Energy-Efficient Processing Unit for Cognitive Computing

Periodic Reporting for period 3 - CoCoUnit (CoCoUnit: An Energy-Efficient Processing Unit for Cognitive Computing)

Okres sprawozdawczy: 2022-09-01 do 2024-02-29

There is a fast-growing interest in extending the capabilities of many systems around us with cognitive functions such as speech recognition, machine translation, speech synthesis, image classification or object recognition, that will replace, extend and/or enhance human tasks in all kind of environments (work, entertainment, transportation, health care, etc.). The CoCoUnit project is investigating the design of new computing system architectures that are highly energy-efficient, especially for those systems that make intensive use of these cognitive functionalities.

We follow a disruptive approach by researching unconventional architectures that dramatically improve energy efficiency while delivering substantial performance gains. These platforms use various types of units specialized for certain domains, and we place special emphasis on brain-inspired architectures (e.g. neural networks) and graphics processors due to their potential to exploit massive parallelism and their high energy efficiency. We propose extensions to existing architectures combined with novel accelerators and functional units. The end result of this project is to devise new platforms that provide new experiences to users in the areas of cognitive computing and computational intelligence on mobile devices, embedded systems, servers and data centers.

Some of the most relevant results to date include:

The design of an accelerator for neural networks that includes novel techniques to reduce energy consumption, such as computation reuse, pruning of neurons and connections, dynamic selection of the precision used in calculations, increasing locality in memory accesses, and a new workload scheduling mechanism for recurrent neural networks.

The design of a “system-on-chip” that includes a general-purpose processor and various accelerators for automatic speech recognition, and achieves real time with very low energy consumption.

The design of a new unit to improve the performance of graphics processors for graph algorithms by reordering, merging and filtering out redundant memory accesses and related activity.

A microarchitecture for graphics processors based on exploiting coherence between successive frames to reduce computations and substantially improve their energy efficiency, as well as a new organization of its memory hierarchy to better exploit locality in accesses.

A detailed characterization of the performance and energy consumption of computing systems for autonomous vehicles and the proposal of an accelerator to optimize one of its main bottlenecks, simultaneous localization and mapping.

A novel hardware approach to detect occluded fragments in graphics processors and remove them from the computation pipeline.

A new microarchitecture extension to graphics processors that predicts occluded primitives in the early stages of the processing pipeline and avoids all its associated memory accesses and computations.

A programmable accelerator for automatic speech recognition targeted to edge devices that can be easily adapted to implement alternative/future models while providing high performance and low energy consumption.

An extension to TVM, a compiler and auto-tuner for DNN systems, to extract meaningful hardware-related features that improve the quality of the representation of the search space and the accuracy of its prediction during the auto-tuning process.

A novel methodology for efficient simulation of graphics workloads that is capable of accurately characterizing an entire video sequence by using a small subset of selected frames, which substantially reduces the simulation.

A new workload scheduler and microarchitecture change for GPUs that improves texture cache locality and minimizes workload imbalance penalties.

A novel high-performance and energy-efficient architecture extension to exploit Sliding Window Processing in conventional CPU cores, and its detailed evaluation for autonomous driving workloads.

This project advances the state-of-the-art in a number of ways.

(1) We are the first to identify the potential of reusing computations in DNNs in a number of innovative ways to avoid ineffectual activity. We also propose a novel dynamic adaptive quantization scheme for RNNs to reduce compute and memory activity.

(2) We debunk conventional neuron pruning schemes by showing that they behave close to a random policy, where the only parameter that matters is the degree of pruning. We then propose a highly effective pruning scheme that overcomes the huge overhead of traditional schemes.

(3) We take a disruptive approach to deal with the extremely low efficiency of graphics processors for graph-based algorithms. Our approach is based on extending the graphics processor with an additional programmable unit that is responsible for optimizing the locality for memory requests, and also for identifying and removing much redundant activity that typical graph-based algorithms tend to generate.

(4) We are the first to implement an optimal replacement policy for a cache memory. We show that such a policy can be implemented with a moderate cost and provides significant benefits for the tile cache of graphics processor. We also develop a dynamic mapping scheme for the texture cache.

(5) We leverage the temporal coherence existing in consecutive frames of graphics workload to devise several mechanisms that reduce the number of computations without compromising the quality of the rendered image.

(6) We propose a novel platform for automatic speech recognition that achieves human quality and can be deployed on edge devices that have very stringent power and cost budgets. The proposed platform leverages several accelerators but at the same time is programable with a simple API, which makes it suitable for a variety of current and future algorithms that are likely to appear in this rapidly evolving area.

(7) We extend the TVM compiler and auto-tuner framework with a new approach to represent the search space that increases the effectiveness of its auto-tuning function.

(8) We design a novel CPU microarchitecture and associated ISA extensions to leverage a the processing of sliding windows, which is a very common programming approach used in autonomous driving and many other applications that make use of image processing.

In the remaining of this project, we plan to investigate the use of near data processing technologies to improve the energy efficiency of cognitive computing systems; we will continue to improve graphics processors and general-purpose processors with special emphasis on the organization and management of the memory system; and we will continue our research on novel platforms for self-driving cars.

Website

Accelerator

Periodic Reporting for period 3 - CoCoUnit (CoCoUnit: An Energy-Efficient Processing Unit for Cognitive Computing)

Udostępnij tę stronę

Pobierz