Skip to main content
European Commission logo print header

Automatic code generation for Graphics Processing Units

Final Report Summary - AUTH-AUTOGPU (Automatic code generation for Graphics Processing Units)


Overview

We have successfully developed a platform, named AUToGPU, for multiple objectives:

* facilitating the design and prototype of parallel algorithms for digital signal and image processing(DSIP) as well as for certain basic data processing operations,

* expediting the development cycle, specially, on graphics processing units (GPUs),

* automating performance tuning and optimization, and

* adapting to the GPU hardware updates and keeping the software agile.

AUToGPU embodies a host of methodologies :

* high-level expression of parallel algorithms, based on a unified abstraction of building blocks and compositions for DSIP algorithms;

* a library of templates for basic operations and functions;

* a special-purpose compiler for automatic code translation to or generation of low-level executable codes; and

* a suite of optimization algorithms and techniques for performance tuning.

In the rest of the report, we introduce the motivation, technical rationale, methodologies, development of AUToGPU , and we demonstrate the effective use of AUToGPU with a few typical DSIP algorithms.

GPU computing potential and challenges

Scientific and engineering discoveries have been advanced at historically unprecedented speed and scale by the advance in computer technologies. Extensive computational resources enable scientists and engineers to study ever larger data sets, simulate complex phenomena and design systems at various spatial and temporal scales that had been infeasible previously, pushing the boundaries of knowledge, leading to discoveries and products that benefit the humanity.

In the last 25 years, we have witnessed a dramatic change in the field of high performance computing, both in terms of increase in performance and also in reduction of costs, driven mainly by the great success of microprocessors. The scene has changed in the last decade with the appearance and rise to the top of performance charts of the Graphics Processing Units (GPUs). GPUs now consist of thousands of very simple processing units specialized in parallel numerical computations. Although GPUs were originally designed for processing of ray tracing for graphical life-like scenes in computer games, they are now utilized for high performance numerical computation.

Modern GPUs can achieve theoretical rates of several trillion floating point operations (TFLOP), GPUs therefore offer the computational capabilities of supercomputers at affordable prices. However, the lack of parallel codes that efficiently utilize thousands of computing cores is dire. Moreover, the GPU parallel codes that do attain high performance, oftentimes depend heavily on specific architectural features and do not easily port to a different or newer GPU hardware even from the same manufacturer.

The dependence of high performance on a specific GPU architecture and the short life due to short product cycle, impose a very heavy burden on programmers responsible for code optimization and maintenance.

Not many things have changed in the GPU programming field since the start of this project. The two competing leaders of the GPU hardware, NVIDIA and AMD-ATI, still remain behind two different programming approaches for general purpose GPU computing. NVIDIA is putting all its support to
the proprietary programming language CUDA. AMD is using the open standard OpenCL (Open Computing Language), supported by an industry consortium. OpenCL is a framework based on the C99 programming language for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. The two high level programming environments are similar, but incompatible to each-other and require a lot of effort to use. Portland Group International (PGI) released a compiler to support CUDA code generation using source code annotation directives. PGI was recently acquired by NVIDIA. Intel has also gotten into the field of compute accelerators with the Xeon Phi, but so far it has not been able to achieve the same levels of performance as AMD and NVIDIA. The programming model of the Phi is more traditional as it uses long standing standards PTHREADS, OpenMP, and MPI.

Methodologies

To alleviate the problem of parallel programming for GPUs we introduced AUToGPU, to generate high performance CUDA implementations in semi-automatic means without overburdening the programmer. More specifically, with AutoGPU, we introduced a methodology and system to extend the life cycle of optimized numerical codes on GPUs by

1) defining the means to express and prototype parallel algorithms for digital signal and image processing and certain basic data processing operations,

2) automating performance optimization, with code generation, the use of code templates and self tuning algorithm libraries that iteratively explore algorithm variants

3) adapting to the GPU hardware updates and keeping the software agile.

AutoGPU utilizes special-purpose compiler techniques and appropriately designed mathematical abstractions, to explore the potential in high performance computation by manipulating domain-specific mathematical structures to match them to a GPU architecture. A unified abstraction of the algorithms in high-level mathematical expression is processed in multiple stages. Rules transform the algorithm abstraction into equivalent variants using mathematical identities. The systematic symbolic exploration of algorithmic variants helps uncover CUDA implementations that may perform better in a given GPU for a range of data problem sizes.

Demonstration and deployment

The AUToGPU methodology was developed and code scaffoldings were generated and put into use in several projects and student theses for the engineering diploma and an engineering master thesis.

A tool was developed that accepts as input an abstracted computational kernel on which it applies a series of transformations to explore memory accesses available in the programming model of CUDA. In addition, the tool measures the performance of the resulting CUDA kernels for a certain set of customization parameters in order to decide which leads to an optimal implementation. The above techniques are described in the Master Thesis by Evangelos Savvidis and the Diploma Thesis by Alexandros-Stavros Iliopoulos.

The AUToGPU technologies were put into use for the continued development and release of the FLCC Library, a collection of highly optimized functions for the calculation of one, two and three dimensional convolutions and correlations with and without local normalization. FLCC has been the motivating example for the AUToGPU project. The results are described in the Diploma Thesis of George Papamakarios and George Rizos.

The AUToGPU methodology was also used in the development CUDA implementations of narrow directional steerable filters for the estimation of optical flow in video streams, as presented in the Diploma Thesis by Vasiliki Siakka.

The principles of AUToGPU code variants are used for optimized implementations of machine learning primitives for nearest neighbour finding and clustering of very high dimensional and massive data in the research work of PhD candidate Nikos Sismanis. Similarly Alexandros-Stavros Iliopoulos , now a Duke University PhD candidate has put into use derived components of this project in the pipepline for stitching multiple image snapshots with scarce overlap to generate large panoramic images.

In conclusion, the great success of NVIDIA and the GPU architecture and their domination of the high performance market is acting as a paradigm shift for the large microprocessor provider companies. AMD has acquired the GPU company ATI and Intel has entered the GPU accelerator market with the release of Phi. We expect that tools like AutoGPU will be serving a much larger domain of processors in the near future, because the GPU architecture won’t be the oddity of the microprocessor design space but rather the norm.

Documentos relacionados