Open Source FPGA Accelerator & Hardware Software Codesign Toolset for CUDA Kernels

Final Report Summary - FASTCUDA (Open Source FPGA Accelerator & Hardware Software Codesign Toolset for CUDA Kernels)

Executive Summary:
Scientific applications such as graphics, biological modeling, molecular dynamics and others, are usually highly parallel and can benefit from specialized hardware to accelerate their execution. For this reason, highly parallel Graphic Processing Units (GPUs) have been traditionally favored over General Purpose Processors for running such applications. In the same way, FPGAs can potentially provide even higher speedups at lower power consumption than GPUs. However, their use is still limited since the path to porting an application onto FPGAs’ custom hardware is often prohibitively cumbersome. Therefore, FASTCUDA facilitates this path by providing a novel methodology, architecture and toolset to automatically port and run already-parallelized algorithms onto reconfigurable hardware. For this purpose, the FASTCUDA methodology utilizes CUDA, a Graphical Processing Unit (GPU) language, which exposes parallelism at source code.

The FASTCUDA toolset splits, with minimal user intervention, application's code into two parts: one that is compiled and executed as parallel software on an embedded multi-core, and another consisting of multiple special-purpose accelerators that are synthesized and implemented in hardware. A last generation low power FPGA provides the processing power and the logic capacity to implement and execute both parts.

In particular, FASTCUDA is a design methodology and accompanying toolset that allows CUDA programs to be executed efficiently on a shared memory, multi-core CPU communicating with an FPGA-based accelerator. A multi-core processor, consisting of multiple embedded cores (configurable small processors), is used so as to run the host program serially and the SW CUDA kernels in parallel. Threads belonging to the same CUDA thread-block are executed by the same core. The HW CUDA kernels are partitioned into thread-blocks, and synthesized and implemented inside an “Accelerator” block. Each thread-block has a local private memory while the global shared memory can be accessed by any thread following the philosophy of the CUDA model.

For our prototype version, we have used the Xilinx Virtex-6 FPGA with 500MB of external DDR memory placed on a Xilinx ML605 evaluation board, and the multi-core processor consists of an array of Xilinx Microblaze CPUs. However, real products designed with FASTCUDA may also use faster embedded processors such as the ARM Cortex-A9 MPCore.

Project Context and Objectives:
In recent years, an observable trend in High Performance Computing (HPC) architectures has been the inclusion of accelerators, such as Graphical Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to improve the performance of scientific applications. Several applications, ranging from graphics, to biological modeling, molecular dynamics, physics and others, have been successfully ported to GPUs, taking benefit of highly parallel hardware to accelerate their execution. Porting to GPUs, hard as it may be, requires only software skills to code the specific algorithm into parallel multi-threaded software. On the other hand, the path to FPGA development is notoriously more difficult since porting an algorithm to custom hardware is less straightforward, and the simulation-verification-debugging cycle can be many orders longer. For this reason, even though FPGAs’ custom hardware can potentially provide higher speedups at lower power consumption than GPUs, GPU-based solutions dominate the scientific world.

FASTCUDA aims to bridge this gap by taking advantage of the software parallelization effort that has gone into porting scientific applications to GPUs, and utilize it so as to implement FPGA-based systems. FASTCUDA focuses on CUDA, a GPU architecture and programming model initially developed by Nvidia for its line of GPUs, and provides a novel methodology, architecture and toolset to automatically port and run CUDA programs onto FPGA hardware.

Execution starts with the CUDA host program running single-threaded on the host CPU. Whenever a CUDA kernel is invoked, the host CPU dispatches the execution of the kernel to an accelerator (separate device) that supports parallel execution of multiple threads. Traditionally these are Nvidia’s GPUs or other multi-core platforms. However, we prove that even higher performance acceleration, as well as lower power and energy consumption, can be obtained if a computationally intensive CUDA kernel is synthesized into hardware and mapped onto an FPGA for execution. Therefore, FASTCUDA employs a hybrid approach: it uses an FPGA-based accelerator for executing the time critical CUDA kernels and a multi-core processor for executing the CUDA kernels that could not fit in the FPGA fabric.

FASTCUDA is a design methodology and accompanying toolset that allows CUDA programs to be executed efficiently on a shared memory, multi-core CPU communicating with an FPGA-based accelerator. A modern FPGA provides all required resources; multiple embedded micro-CPUs for the CUDA host program and the CUDA kernels that will be executed on the multi-core processor as well as large logic capacity for the CUDA kernels that will be accelerated in hardware. Toward this end FASTCUDA has not developed everything from scratch but it has joined numerous on-going efforts in industry and academia to create a unified efficient open-source framework.

The objectives of FASTCUDA were twofold:
1. create an innovative embedded system design flow by designing highly efficient components and by taking advantage of numerous open-source ongoing efforts in codesign of embedded systems, both at the academic and at the industrial level
2. enable an easier transition from research results to industrial exploitation, i.e. standardization of codesign usage

FASTCUDA has successfully defined the new design flow and has provided to the open-community the related toolset. The objectives have been achieved by defining, implementing and disseminating a publicly available platform that takes as input a description of the system in the CUDA programming model, and produces an efficient FPGA-based embedded design that executes certain CUDA kernels in software, while it implements the rest in hardware according to a hardware/software partitioning algorithm that has been developed throughout the project.

In order to fulfill the aforementioned objectives we have built the FASTCUDA platform which is comprised of the following sub-systems:
• A novel reconfigurable computing (RC) architecture composed of a multi-processor system, shared memory and reconfigurable fabric in order to run the multi-threaded CUDA applications.
• An advanced high-level synthesis tool which efficiently maps the coarse and fine grained parallelism exposed in CUDA kernels onto the reconfigurable fabric.
• A compiler framework in order to port the CUDA programming model to the FASTCUDA multi- processor environment.
• A design space exploration strategy based on profiling, user-driven block partitioning, and analysis by simulation, compilation and high-level synthesis of the quality of each point in the design space.
• A central on-chip processor that coordinates the execution of the CUDA kernels and executes the main code (referred as host code in the CUDA programming model) of the CUDA application.

The FASTCUDA platform is relatively easy to use through a graphical user interface (GUI) in order to gain wide acceptability by the embedded design community. Especially, as the tool targets the group of designers programming in a high-level and it is critical to speed-up their design time, the factor of having a tool that operates in a user friendly environment is of major importance. This can play an important role to the wide adoption of the tool.

Project Results:
FASTCUDA's main target was to derive a high level synthesis toolset in order to efficiently run a CUDA application on a FPGA-based hybrid platform which consists of a multi-core processor and an FPGA accelerator. Throughout the project several tools were developed. A brief description of the main results/foregrounds is the following:

1) High Level Synthesis tool: A complete software package that takes as input a CUDA kernel, which describes a part of the application and provides as output synthesizable multi-threaded SystemC code and RTL code that implements the exact same functionality with the input.

2) CUDA to multi-threaded C Compiler: A complete software package that (a) takes as an input a CUDA kernel, which describes a part of the application and (b) provides as output a CPU-based code performing the exact same function.

3) Multi-core processor: A hardware package that provides a multi-core CPU platform customized for the executions of CUDA kernels.

4) Εstimation tools: Software packages that given a CUDA description of an application, they provide performance estimation numbers.

5) Εxploration tool: A complete software package that takes as an input a description of an application in CUDA (including the parts that will be implemented both in hardware and in software) as well as the characteristics of the FPGA-based platform and gives the necessary performance and power estimations for various hardware-software partitioning alternatives to the designer, so as to allow him/her to choose the optimal underlying architecture.

6) SW-HW bridge and system API: A hardware package that provides the SW-HW bridge between the multi-core and the FPGA accelerator, a software package that includes the SW-HW communication API library.

7) Since there was no available Xilinx IP core which could provide cache coherency for the FASTCUDA multi-core processor, FASTCUDA built its own HW blocks which provide cache coherency.

8) Numerous CUDA applications have been developed addressing different application domains from security to bioinformatics.

Potential Impact:
The market of embedded systems is huge. The Global semiconductor sales in 2013 will amount to $317.9 billion, up 4.9 percent from $302.9 billion in 2012.. In parallel, the move to ESL design is accelerating and the commoditization of numerous such tools, with the accompanying price pressures, is continuing . Electronic System Level (ESL) design and verification is an emerging electronic design methodology that focuses on the higher abstraction level, and seems the most promising and active approach towards fast and efficient embedded systems’ design tools. The way that each EDA company adapts to ESL design is likely to cause some important changes in the EDA market dynamics. The EDA industry has recently seen a number of large ESL acquisitions, most notably that of Denali by Cadence, as well as CoWare, VaST and Virage Logic, which were acquired by Synopsys. EDA will look a lot different, than it does now, by the year 2015. Platforms such as FASTCUDA that offers an advanced ESL design framework can provide a vital, reliable and lucrative solution in the expanding market of embedded electronics.

Europe is a recognized leader in embedded system design and applications. Embedded systems affect all market segments in a revolutionary manner and in all societal aspects, for example in the form of cellular phone networks and equipment, automotive electronics and avionics, while they are also impacting most of the European population. For example, it has been estimated that more than 90% of innovation and more than 50% of future car value will come from automotive electronics (source: DaimlerChrysler). In order to preserve and expand this leadership, Europe must learn how to create cheap and high-performance embedded electronic hardware, with faster time-to-market that will satisfy the needs of its citizens.

Of all these factors, design time is probably one of those in which retaining leadership is more vital, and at the same time more difficult due to the higher labour costs. This can be achieved only by improving designer productivity, a key aspect of the high-level codesign approach proposed by this project.

A well-established design methodology is essential in all the scenarios that envision broad access to embedded electronics. While software can be used to add rapidly features and customize existing hardware platforms, it is clear that rapid derivative design for very low-power and/or high-performance applications can only be achieved by means of true codesign and efficient hardware synthesis algorithms. For example, while today’s wireless vision sensor network (WVSN) prototypes mostly rely on software for application customization, a (possibly reconfigurable) multi-core hardware solution can bring the power consumption down by several orders of magnitude, and hence bring battery life to several years, or enable much more sophisticated signal processing algorithms that can reduce radio power consumption further.

FASTCUDA delivers an integrated tool chain that responds to the needs of the industry for rapid design and prototyping of multi-core embedded systems; this is achieved by enabling the use of abstract, high-level descriptions of embedded systems in a widely used programming model (i.e. CUDA), which can significantly speed up design and verification time. Moreover, by enabling concurrent design, FASTCUDA promotes cost-efficient software and hardware integration and reduce software/hardware design and prototyping time, thus addressing two additional important needs of the embedded systems’ industry.

The most important advantage of the FASTCUDA approach is that enables fast hardware/software co-development, by automatically mapping a high-level functional model onto a specific architecture with given performance/area/power constraints, providing a cycle-accurate model of the hardware, thus allowing for very efficient concurrent hardware and software refinement. Moreover, the FASTCUDA framework provides a semantic model that enables different algorithms developed by industrial and academic teams to cooperate in order to deliver an effective complete codesign suite. FASTCUDA blends the advantages of two different accelerator technologies (GPGPUs and FPGAs) transparently to the designer while it provides an open-source novel platform that can meet the requirements of the demanding today’s market.

Therefore, the outcome of FASTCUDA is “(i) a technology for efficient resource management and design space exploration, (ii) a framework respecting trade-offs when co-developing hardware and software and (iii) an open-source tool that allows synthesis of embedded hardware from CUDA programming model”.

This speed-up of the development cycle is one of the major potential advantages of the CUDA approach. The sequentiality between hardware design, which traditionally requires months to reach the level at which drivers and other hardware-dependent software (HDS) components can be written, is one of the major causes of delays in embedded software projects. Errors introduced at this stage are a significant source of slips in schedule that, according to reports, affect between 50% and 80% of all embedded system designs. HDS is a relatively low-level component of the overall embedded system architecture. However, its design is particularly time consuming, due to a number of reasons. First of all, often its timing-dependent functionality (e.g. which depends on the satisfaction of real-time constraints) depends on the availability of cycle-accurate models of the hardware, because the performance of the HDS must be assessed directly at a cycle-by-cycle level (consider e.g. a device driver). Second, its low level of abstraction, coupled with its highly concurrent nature and the presence of hard-to-predict interrupt effects, all contribute to make it very hard to specify, implement and debug. If the output of codesign is of acceptable quality in terms of area and power consumption, then its synthesis and physical design can start right away. Otherwise, the synthesized model can be used as a golden specification that the manual RTL modules must comply with, in order to produce results that are compatible with the concurrent development of software.

FASTCUDA has participated in a number of liaison activities with other EU research projects through its participation in a number of events. FASTCUDA has been discussed with participants from other EU research projects through its presentation in the 15th Euromicro Conference on Digital System Design on September 5-8, 2012. Some liaison activities have also started with the HiPEAC NoE. The FASTCUDA project has been presented in HiPEAC Computing Systems Week in October 2012. Finally, the FASTCUDA platform has been demonstrates and discussed with members from other EU projects at the University Booth - DATE in Grenoble on the 20th of March 2013. We have also prepared a paper which includes the evaluation results from FASTCUDA and we are planning to submit it to a scientific journal.

The dissemination and exploitation of the FASTCUDA results was a major target throughout the project. In particular FASTCUDA has built a marketplace (http://www.cudakernels.com) which will provide synthesizable cores built based on the FASTCUDA flow.

In order to further disseminate the FASTCUDA architecture and toolset the FASTCUDA project offers to the open source community most of the tools developed throughout the project. The tools as well as example designs are published on sourceforge (http://sourceforge.net/projects/fastcuda/).

Moreover, we have developed numerous CUDA applications that can be further exploited by the SMEs addressing different application domains from security to bioinformatics. Those applications are in addition to the basic ones that have been described in the DoW and they are described in D6.4. The SME partners are planning to use the FASTCUDA toolset and the developed applications internally.

List of Websites:
The FASTCUDA web site (http://fastcuda.eu/) has been developed and is regularly updated by TSI from the very start of the project. The website’s main objective is to diffuse the FASTCUDA’s objectives and results as wider as possible, throughout the community and over and in parallel to operate as project’s repository.

Contact Details:
Project Coordinator
Name : Luis Redondo
Company : Ingenieria de Sistemas Intensivos en SW
Address : Calle Eloy Gonzalo, 27 Agrate Brianza, I-20041, Italy
Email : lredondo@inetsis.es

Project Technical Manager
Name : Iakovos Mavroidis
Research Institute : Telecommunication Systems Institute Technical University of Crete
Address : Kounoupidiana, GR73133, Chania, Greece
Email : iakovosmavro@gmail.com

Final Report Summary - FASTCUDA (Open Source FPGA Accelerator & Hardware Software Codesign Toolset for CUDA Kernels)

Download Download the content of the page