Objective 1
Six applications were selected and investigated in order to derive the operations (kernels) which could be performed in CIM core; hence not only reducing the communication between CPU and memory, but potentially also reducing the energy consumption and improving the overall performance. Applications were selected from three domains (i.e. Data analytics, signal processing, and machine learning). Simulations and comparative studies performed have shown that depending on the selected applications, 5-10x energy benefits are estimated compared with conventional digital CMOS ASICs, and 10-1000x energy benefits are estimated compared with traditional general-purpose processors.
Objective 2
The compiler makes use of the CIM architectures developed in WP2 and the applications identified in WP1. First, we built a micro-instruction compiler for the TTA based CIM micro-architecture developed in WP2. The micro-compiler can offload the user annotated CIM micro/nano-operations to CIM units integrated to CGRA/TTA architectures, taking into consideration of computation and communication resources in the architecture, and available parallelism in the user program. Second, we designed a macro-programming interface by following a top-down approach to support the programming of multiple CIM units at system level. Third, we created a full end-to-end compilation flow from the high-level representation to code that executes on the macro CIM architecture proposed in WP3 and demonstrated the flow on the full-system simulators developed in WP5 with the CIM applications identified in WP1.
Objective 3
First, an initial macro CIM architecture was designed. Based on the CIM tile (micro-architecture level) developed in WP4, a multi-tile CIM accelerator (i.e. macro Architecture) and its instruction set ISA were developed. This CIM was integrated as a functional unit into TTA, a data path aware architecture. Second, integration of the CIM core (based on CGRC) in computing cluster has taken place. Here, we made use of multi-core PULP system (developed by ETHZ) as a computing cluster. Efficient communication fabric for tight integration of CIM core as an in-memory accelerator have been developed as part of this work. Third, the key kernels derived from selected applications in WP1 have been mapped on CIM macro architecture. Work concentrated on Hyper-Dimensional (HD) computing (see D1.2) that make intensive use of small instructions, on ultra-wide words (e.g. bit wise XOR). HD computing is essentially about manipulating and comparing large patterns (e.g. hypervectors of 10,000 bits) stored in memory which works to the strength of CIM accelerator tiles. Our results indicate up to 60x improvement in peak throughput and 40x improvement in energy efficiency for GEMM type workloads.
Objective 4
First, small crossbars were characterised and models were developed. Second, memristor based primitive logic and arithmetic circuits were designed and validated using SPICE simulation. Third, a complete (first version of) CIM tile micro-architecture was developed and designed (using C code), including ISA (instruction set architecture) and a compile being able to translate the macro-instruction defined in WP3 to a nano-instructions that can be understood by the CIM tile. Fourth, on the lowest hardware level, the impact of the memristor array architecture and ADC design choices on the performance of MAC operations for ReRAM based memristor devices was investigated. Next, CIM tiles, based on an STT-RAM memristor array, were optimized for performing binary logic operations as well as for performing Matrix-Matrix Multiplication (MMM) operations.
Objective 5
The framework supporting the characterization and study of the presented CIM architectures was materialized in the refined versions of the simulator (D5.6) and emulator (D5.8) packages. These key milestones highlight the quality of the knowledge generated in all the technical work packages and gather in a coherent simulation/emulation platform an end-to-end approach for benchmarking the applications – studied in WP1 — that using the compilers provided by WP2 were accelerated by the proposed architecture – WP3, noting that the modelling of the CIM submodules was accurately calibrated against the WP4 engineered circuits.