Evaluation and benchmarking of various applications and computational methods for potential acceleration with VEC and MLS chips in the EUPILOT platform has evolved. Two HPC applications have been characterized and being ported to VEC: GROMACS and a set of dwarfs derived from EC-EARTH. For AI, a video surveillance model on top of YOLO and a couple of language models (BERT and GPT-2) have been characterized and evaluated in terms of acceleration capabilities. A new workflow integrating HPC and AI is being designed combining molecular dynamics and language models, targeting VEC and MLS, respectively.
Regarding numerical libraries, the team has been working on the implementation of BLIS and FFTW kernels for both VEC and MLS, to support the HPC applications. To support the AI frameworks and models, the team has successfully implemented a version of oneDNN targeting the VEC accelerator. For MLS, converting ONNX models to DaCe and an initial version of the MLS backend has been released. The team has been developing a memory management solution and a version of TensorFlow that dynamically links with oneDNN, to feed Arax. Arax has been ported to a RISC-V QEMU environment. The team has provided a oneDNN library optimised for VEC to assist the integration to TensorFlow. Work has been started on integrating Tarantella with DaCe/TensorFlow.
Co-design work has been performed to start developing tests for verifying OpenMPI's data transfer engine's (DTE) functionality. The team has worked towards the final goal to port and optimise the OpenMP runtime for VEC, with a focus on locality awareness and better energy efficiency. Similarly for MPI, a new component to OpenMPI has been developed to optimize collective operations making use of the hardware extensions in VEC. Effort has been devoted to develop an optimized version of the TAMPI (Task-Aware MPI) library that manages all concurrent MPI requests internally as well as a porting of the DLB (Dynamic Load Balancing) library to RISC-V.
Work was done on node- and cluster-level resource management, based on the integration of three components: SLURM, Konro and DROM. The malleability work is completed with the deployment of DMRLib on RISC-V. The team has worked on porting recent Linux kernel and root file system with the appropriate customizations (device drivers), an environment to ease image file generation and deployment as well as a new Fast Context Switch module to better support OpenMP free-agent threads and DLB.
In terms of tools, the team has been working on integrating the Fortran front-end of LLVM with the EPI compiler to pave the way to vectorisation and optimized code generation for both VEC and MLS. An initial release of the memory interference analysis engine, supporting the analysis of scalar and vector memory instructions, has been implemented in LLVM as a RISC-V back-end pass.
The hardware team focused on two main areas in parallel.
The tapeout of the so-called test-chip done and the chip will arrive in June 24. The characterization and debugging of all critical structures to be used in future chips will take around three months.
Work was performed for the implementation of parts of the uncore, with the C2C controller, LPDDR controller, and the CXL controller (with their corresponding PHYs) in addition to the power management controller, PLLs, etc. Most of this work was included in the test-chip.
In parallel, work was done to reach freeze milestones in VEC and MLS designs.
In terms of the memory hierarchy, design work was performed for cache improvements and feature upgrades in the intra-chip coherency mechanisms. In the RISC-V/VEC cores, performance increases can be expected from a 4x increase in handling outstanding misses. Work was also performed to extend the AMBA5 CHI. The first interface specification for the I/O coherent data-transfer engine (DTE) was created, along with the DMA engine.
Work was performed to improve the VEC core from a 2-way in-order design to a 3-way out-of-order core. The interface between the core and the VPU (OVI) was improved to version 2, with changes in the core and the VPU. There are also improvements in the NoC of virtual channels, enabling inter-chip routing.
On the MLS side, improvements have been performed in the integration of the SPU to the snitch integer core, memory-mapping of the SPU and further integration improvements.
Verification efforts have been devoted to transiting from version 0.7 to version 1.0 of the V extension. Effort continued in the multi-FPGA environments with C2C protocol extensions.
In the systems area, work for the development of the testboard that will host the test-chip was complete, and the system specifications and definition of the requirements has contiued.
Finally, planning for the deployment and operation of liquid immersion cooling tanks continued.