Skip to main content

Tensor Processing on FPGAs for Artificial Neural Networks

Periodic Reporting for period 1 - TPANN (Tensor Processing on FPGAs for Artificial Neural Networks)

Reporting period: 2017-05-01 to 2018-04-30

Artificial intelligence and, particularly, neural networks are successfully conquering more and more application domains. They help to improve our quality of life and to rid us from repetitive and tedious duties. Their applications range from noise-canceling hearing aids over machine translation to powerful image processing algorithms detecting and classifying objects in real-time video. Albeit amazingly effective, the deployment of neural networks poses an enormous computational challenge. The acceleration by power-hungry GPU farms is the norm rather than the exception. However, neural networks have been shown to be extremely resilient against the quantization of the backing computation to numerical values of harshly constraint ranges. Researchers working with programmable hardware, including the hosting research team of Michaela Blott at Xilinx Ireland, have demonstrated that even binary quantization, leaving only two possible numerical options for each operand, can yield capable neural network implementations in some application domains.
The successful quantization of neural network inference is highly relevant as it allows to simplify the backing arithmetic. The platforms that are able to extract benefit from every single saved bit are programmable hardware devices as made by Xilinx. These reprogrammable physical electrical circuits are able to translate simpler operations directly into greater operational density and concurrency. Thus, quantization allows small, power-efficient devices to deploy capable neural networks. Their use becomes a green option and is enabled in more difficult application environments as in embedded or remote contexts or in cyberphysical systems.
The goal of TPANN was the rigorous optimization of the neural network inference on programmable hardware devices. Particularly in the ubiquitous convolutional networks, it is the computation of a vast number of dot products that poses a critical challenge. His strong background in digital design and specialization in computer arithmetic of the fellow, Thomas Preußer, was key in this effort. One illustrative result of the work was the development of an object detection demo working on a live video stream running on a small embedded heterogeneous all-programmable device. The work also yielded two invention disclosures that are currently undergoing internal patent review.
The project work started with getting to know the neural network solution developed by the hosting research team at Xiinx Ireland, the FINN framework. Special attention was designated to its C++ libraries used for the high-level synthesis (HLS) of inference engines on the programmable fabric as developed by Xilinx. Basic background research was conducted on solutions competing in the neural network inference market using different technologies such as Google’s application-specific 8-bit TPU chip and NVIDIA’s GPU-accelerator library cuDNN. A more intense analysis was designated to NVIDIA’s NVDLA accelerator, which can easily be used on programmable hardware and was released as open-source project in September 2017.
The conducted work included the design and the characterization of highly-optimized arithmetic kernels for various low-precision quantization schemes on different abstraction levels. These results allowed the team to identify the points in the system design space that offered the most interesting trade-offs between the hardware resource investment and the achieved network accuracy.
The TPANN project produced a host of reference implementations particularly for the dot-product computation that is so key for the convolution. The diversity of implementations was motivated by the desire to fully exploit all the capabilities of Xilinx' modern all-programmable devices and by different application goals. A software solution leveraging the NEON vector extension of the ARM CPU cores on Zynq devices helped to conquer the computational challenge of those layers in an object detection network that were sensitive to a quantization below 8 bits. This implementation became part of a demo for the object detection in live video using a network derived from the TinyYOLO topology. This demo was first publicly presented by Thomas at the FPL conference in 2017 and went viral within the company thereafter and has been since used for conferences and job fairs alike. Another refurbishment was undertaken on the convolutional layer library backing the FINN framework and published open-source as part of the BNN-PYNQ project. These high-level C++ models implemented for the high-level synthesis flow were not further optimized for performance but made more flexible. The templated rewrite of the processing engine now allows the full customization of the operand quantization. This provides a quick way to explore and analyze the design space of an application and derive a balanced implementation easily. Finally, systolic dot-product engines were designed and evaluated on the very low structural level of the programmable hardware fabric. Thomas has developed new techniques to map quantized multi-bit operations efficiently to the DSP compute cores found on Xilinx devices. The produced reference implementation was adopted in the development processes of two Xilinx products. Two related invention disclosures are currently reviewed internally to decide between patent filing or immediate publication.
Beyond the technical dimension, Thomas was able to develop his leadership skills in the course of the project. He supervised interns and coordinated publications and invention disclosures as main author. He also represented Xilinx with invited talks at three conferences (FPL'17, FPGA'18, and DATE'18) and supported Xilinx outreach activities by attending the company booth at FPL'17 and by presenting to transition-year students at during an in-house orientation week in spring 2018. Finally, he engaged in non-technical dissemination activities promoting the Marie-Skłodowska-Curie instruments at an event organized by the Graduate Academy of TU Dresden.
The results of TPANN have been exploited (a) for building an object detection demo on live video that was first shown by Thomas at FPL'17 and has been used for conferences and fairs throughout the company since then, (b) for an invited keynote talk presented by Thomas at FPL'17, (c) for publications at ICCD'17 and DATE'18, and (d) for integration in two Xilinx products related to accelerated neural network inference. Two techniques devised in the course of the project are currently reviewed internally to decide between patent filing or immediate publication.
The project has produced two key advances beyond the state of the art: (a) the full exploitation of the various compute capabilities of a heterogeneous embedded platform for a challenging live object detection, and (b) an extremely efficient mapping of low-bitwidth operations to wider integer datapaths. While the first advancement was published at DATE'18, the second is still undergoing an internal patent review.