Outplaying the hardware lottery for embedded AI

Informations projet

BINGO

N° de convention de subvention: 101088865

DOI

10.3030/101088865

Date de signature de la CE 2 Mars 2023

Date de début 1 Juin 2023

Date de fin 31 Mai 2028

Financé au titre de

European Research Council (ERC)

Coût total

€ 1 995 750,00

Contribution de l’UE

€ 1 995 750,00

1 995 750,00

Coordonné par

KATHOLIEKE UNIVERSITEIT LEUVEN
Belgium

Periodic Reporting for period 1 - BINGO (Outplaying the hardware lottery for embedded AI)

Période du rapport: 2023-06-01 au 2025-11-30

The next wave of smart applications in our society will need embedded devices (robots, wearables, etc.) with increased intelligence at much reduced energy and latency cost. Compared to current embedded platforms, up to 1000x efficiency gains could be achieved through tight processor-algorithm co-optimization. However, due to the slow development cycle of processor chips (many months to years) in comparison to algorithms (hours to weeks), this co-optimization today merely boils down to selecting algorithms which run well on mature, available hardware. As these processors and their tooling have been optimized for mature algorithms, not the inherently best algorithm “wins”, but the one that happens to best fit the available “old-school” hardware platforms. This “hardware lottery” holds back innovation, severely impacts embedded AI execution efficiency, and narrows the market to a few large companies.

The BINGO vision to break this innovation deadlock is to enable heterogeneous compute platform customization for a given AI workload in a matter of days (100x faster), through rapid selection and assembly of prefabricated co-processor chiplets. This needs breakthroughs in: a.) A library of embedded-AI-optimized co-processor chiplets, surpassing the SotA in terms of dataflow heterogeneity for improved efficiency (100x over CPU); and inter-operability in heterogeneous chiplet meshes on a reusable “breadboard” interposer. b.) Rapid cost models and workload schedulers for beyond-SotA heterogeneous platform customization: automatically deriving the optimal chiplet combination for an application, assemble it and deploy, all in a few days.

The BINGO project has completed 30 months (2.5 years) of its five-year duration and is well into its first implementation phase, with significant progress on the core R&D objectives across all five Work Packages (WPs).

In WP1, we developed a unified abstraction framework for AI workloads and hardware accelerator descriptors. This model — now actively used in both our scheduler (Stream) and compiler backend (SNAX-MLIR) — captures memory-compute structures, fusion patterns, and loop dependencies, and has proven essential for hardware-software co-design. WP1 also yielded the SNAX framework, a reusable accelerator shell architecture combining lightweight RISC-V control, tightly coupled memory, and programmable data streamers. SNAX enables modular integration of custom accelerators into shared-memory systems and is now open source, with strong early adoption. The associated DataMaestro abstraction and SNAX-MLIR compiler have enabled rapid co-development across the stack.

In WP2, we developed a suite of specialized AI accelerators aligned with SNAX. Highlights include the OpenGemm matrix engine (ASP-DAC 2025), BitWave (HPCA 2024) for bit-level sparse inference, and the ViT-Edge accelerator (ISCAS 2024) for compact transformer workloads. Three additional unpublished accelerators — targeting compute-in-memory, hyperdimensional computing, and CGRA-based execution — have also been completed and integrated into our flow.

In WP3, the chiplet-to-chiplet communication protocol was finalized, and the digital design of drivers and receivers completed. The interposer technology has been selected. We also acquired bonding infrastructure, completing our lab setup for future chiplet-interposer integration. Building on this, we have defined the near-term integration roadmap: an interposer tape-out is planned for February 2026, followed by the first accelerator chiplet tape-out in May 2026.

In WP4, we developed the Zigzag/Stream scheduling framework (published in IEEE Transactions on Computers), enabling analytical scheduling of layer-fused DNNs on multi-core accelerators. Now open source with over 150 GitHub stars, Stream is used by multiple external groups. It balances memory reuse, inter-core bandwidth, and latency. Additional efforts include COAC (ISQED 2024) and CMDS (VLSI Transactions) for joint hardware and memory-aware scheduling. Compiler development began with TVM (MATCH, TCAD) but transitioned to a more flexible MLIR backend. SNAX-MLIR now supports pipelined asynchronous scheduling and register-level kernel generation.

In WP5, we completed the 2025 tape-out of the multi-accelerator SoC, with 4 AI accelerators and an RV64 (CVA6) RISC-V host core that demonstrates the benefits of heterogeneous compute clusters and validates the end-to-end toolchain from abstraction and compilation to silicon. The SoC is currently being under review for publication.

Together, these results mark a major step toward our overall goal: delivering a functioning, open, and extensible vertical stack to break the hardware lottery in embedded AI design.

BINGO has already delivered several key results that go beyond the current state of the art in embedded AI hardware and systems design, with already some academic and industrial uptake!
At the heart of these results is the SNAX framework, which provides an architectural template for composing heterogeneous accelerators in a tightly coupled, pipelined system. SNAX’s hybrid control/data coupling model — combining asynchronous software-driven task orchestration with tightly coupled shared memory — enables efficient execution of AI workloads without the overhead and rigidity of traditional SoC integration. This is a notable departure from prior accelerator integration paradigms, offering a new template for reusability and modularity in silicon.
Equally important is the Stream framework, which provides a novel methodology for analytical scheduling of fused neural networks under resource constraints. Stream enables transparent trade-off analysis of memory, bandwidth, latency, and compute utilization across a diverse range of architectures, and can be used both for hardware-software co-design and for deployment-level scheduling. This fills a methodological gap in current AI toolchains, which tend to rely on opaque autotuning or fixed-pattern schedulers. The many used of Stream (including industrial players) prove its importance.

Crucially, BINGO has moved beyond “paper architectures” by validating the vertical stack on silicon via the 2025 SoC tape-out, demonstrating that the project’s co-design methodology can translate into a heterogeneous multi-accelerator SoC. This de-risks the upcoming chiplet-based phase of the project.

As discussed, these outcomes are already gaining traction. To further maximize impact and ensure sustainability beyond the project lifetime, several key needs and actions have been identified:
1.) Further demonstration and benchmarking: With the SoC taped out, we will complete silicon bring-up and benchmarking across full-stack scenarios, including heterogeneous scheduling and fused workload execution.
2.) Community and ecosystem growth: Sustained support for the SNAX and Stream frameworks will be needed to continue attracting external contributors and users. We therefore invest in documentation, tutorials, reference flows, and long-term hosting.
3.) Pathways to commercialization: The modularity of SNAX makes it well-suited for technology transfer and IP licensing. Projects with industry stakeholders are ongoing. In parallel, the planned interposer (Feb 2026) and accelerator chiplet (May 2026) tape-outs will enable a scalable chiplet-based demonstration that supports broader uptake by reducing integration barriers for heterogeneous accelerator deployments.

ERC conceptual goals

Periodic Reporting for period 1 - BINGO (Outplaying the hardware lottery for embedded AI)

Télécharger Télécharger le contenu de la page