CORDIS - Risultati della ricerca dell’UE
CORDIS

DESIGN OF EXTREMELY ENERGY-EFFICIENT MULTI-CORE PROCESSOR IN NANOSCALE CMOS FOR MEDIA PROCESSING IN PORTABLE DEVICES

Final Report Summary - RAVEN10 (DESIGN OF EXTREMELY ENERGY-EFFICIENT MULTI-CORE PROCESSOR IN NANOSCALE CMOS FOR MEDIA PROCESSING IN PORTABLE DEVICES)

RAVEN10: "Design of extremely energy-efficient multi-core processor in nanoscale CMOS for media processing in portable devices"

A. Brief overview of research project

The goal of this project is to achieve extreme energy efficiency in new highly parallel many-core architectures that exploit the attributes of the error-tolerant applications to tolerate variability in small technology nodes (e.g. sub-28 nm processes). In order to achieve this, a unified approach is proposed across different stacks of the design flow: software, architecture and circuits. The architecture is named RAVEN (Resilient Architecture with Vector-thread ExecutioN). Each Raven core is a single vector-thread lane with an independent control processor, and hundreds of such lanes will be designed to fit in the area budget. The delay and performance of each core will be notified to the control processor so that it can perform dynamic scheduling of tasks to cores with a goal of achieving extreme energy efficiency. Dynamic voltage and frequency scaling (DVFS) schemes will be applied independently to each core to maximize the energy efficiency of a processor. Hardware and delay performance monitors will be embedded in logic so that process variability can be controlled through immediate dynamic reconfiguration.

B. Major accomplishments achieved

Fig. 1 DVFS for manycore processor

A block diagram of a four-core processor where each core has its own DVFS scheme is presented in Fig. 1. The supply and clock frequency are independently controlled for each core to achieve the minimum energy point depending on the application that is being executed at each point in time. The following subtasks have been accomplished with the final goal of integrating multiple cores on a single die with a complete DVFS scheme implemented.

1. Implementation of fully-integrated reconfigurable switched capacitor circuits

As the number of cores grows, fine-grained DVFS schemes become prohibitively challenging to implement using off-chip inductor-based converters. In contrast, reconfigurable switched-capacitor (SC) DC-DC converters, chosen in this project, can be completely integrated, while offering reduced switch V-A stress and reduced overshoot. Their primary disadvantage lies in the inherent switched-capacitor loss caused by voltage ripple across the flying capacitors and the fact that a conventional digital system is operated based on minimum supply voltage. However, we show that by adapting the clock waveform to the rippling supply voltage through the use of adaptive clock schemes, the voltage ripple can be turned into additional performance, resulting in conversion efficiencies of 90% across a wide range of conversion ratios.

The proposed SC DC-DC converter can be reconfigured for four different voltage domains to enable fine-grained voltage scaling: from nominal 1V voltage for computational intensive applications to 0.5V average voltage for near-subthreshold operation. The key performance tradeoff in fully-integrated SC DC-DC converters is power density vs. efficiency – the higher the power the converter must deliver within a given area, the lower its efficiency will be. In traditional systems where low output ripple is required, this tradeoff is set by the need to balance four dominant losses in the switched capacitor circuit: non-ideality of the switches, the power dissipated for switching, parasitic bottom-plate capacitance loss and the SC converter’s intrinsic output impedance due to charging and discharging the flying capacitor. By eliminating the loss associated with the output resistance of the converter (i.e. allowing the ripple at the output), the converter can operate at substantially lower switching frequency and higher overall efficiency.

2.Energy-delay optimization at system level

Traditional system energy analysis assumes a fixed supply voltage. We perform the analysis of the many-core system energy when operated under changeable supply voltage. The analysis helps us perform a global optimization to find the minimum-energy operating point of a processor core for a desired application performance level. To overcome the limitation of the finite number of conversion ratios in a SC DC-DC converter, we introduce a combined technique exploiting the body-bias voltage tuning applicable to Fully Depleted Silicon- On-Insulator (FDSOI) technology together with DC-DC state hopping. This technique reduces the energy per core by up to 25% compared to DVFS schemes with traditional on-chip SC voltage regulators.

3. Real on-board measurements

The chip featuring a complete DVFS scheme implemented on one core was fabricated in 28nm FDSOI and is currently under test. The testing platform consists of a host PC and two boards: PCB board with Raven chip and FPGA board for communication between the PCB board and PC. We designed the PCB board to enable automatic measurements of current (power) and to enable voltage and frequency settings directly from the PC console by using the I2C protocol. The FPGA board accomplishes two types of communication: one for the I2C settings and the other one to manage the SPI protocol that is implemented on the Raven chip and that serves to send the settings for the DC-DC configuration and reset.

C. Expected final results and their potential impact and use

There is an enormous interest in interactive computing. Portable devices – mobile phones, internet tablets and notebooks – dramatically change in the way we interact with them. Video and speech, as well as other user interface schemes, can be efficiently processed on highly parallel processors. However, one-time fabrication costs for state-of-the-art CMOS designs are now several million euros and total design costs of modern chips can easily total tens of million of euros. These costs are expected to continue rising in the future. In this context, programmable and/or reconfigurable processors that are not tailored to a single application become increasingly attractive. The proposal of energy-efficient data-parallel processor is thus linked to industrial interests and will have not only scientific, but a significant economic impact too. The project is leading to the establishment of very new methodologies, both in electronic circuit design and software for handling data-parallel applications. The expected final result is to have completely integrated four-core system with per-core DVFS implemented, with on-chip dynamic reconfiguration control. The expected energy savings with this scheme range between 5% and 25% per device. Given that the collective annual electricity consumption of the iPhone 5's sold within 12 months is estimated to be equivalent to the annual electricity usage of 54,000 US households, 5% less energy per device would result in saved annual consumption from 2700 households. And that's just for one smartphone model over one year.