Distributed Inference for Energy-efficient Monitoring at the Network Edge

Información del proyecto

DIME

Identificador del acuerdo de subvención: 101062011

DOI

10.3030/101062011

Proyecto cerrado

Fecha de la firma de la CE 19 Mayo 2022

Fecha de inicio 1 Junio 2022

Fecha de finalización 31 Mayo 2024

Financiado con arreglo a

Marie Skłodowska-Curie Actions (MSCA)

Coste total

Sin datos

Aportación de la UE

€ 181 152,96

Coordinado por

FUNDACION IMDEA NETWORKS
Spain

Periodic Reporting for period 1 - DIME (Distributed Inference for Energy-efficient Monitoring at the Network Edge)

Período documentado: 2022-06-01 hasta 2024-05-31

Deep Learning (DL) inference on data from end devices such as IoT sensors, smartphones, and drones enhances operational efficiency and functionality in sectors like industrial automation, smart cities, remote healthcare, and smart agriculture. Performing the DL inference locally on the devices, aka edge ML inference, promises to boost energy efficiency, responsiveness, and alleviate privacy concerns. Nevertheless, the research emphasis had been in design of embedded DL models with higher accuracy without systematically studying their latency and energy consumption on the devices. Consequently, existing distributed DL inference techniques does not achieve important trade-offs between accuracy, latency, and energy consumption. In this context, the DIME project has made significant contributions toward building efficient edge ML systems.

In order to understand the trade-offs between accuracy, latency, and on-device energy consumption per inference, in this project we conducted a comprehensive measurement study on multiple devices that span different processor types, including CPU, GPU and TPU, using datasets of varying complexity. Further, to assess the performance of distributed DL inference between a device and a server, we also conducted measurements for offloading the data samples using different communication protocols WiFi and Bluetooth, and the measurements for runtimes of state-of-the-art DL models on two distinct servers, one with NVDIA Tesla GPU and the other with A100 GPU.

We also studied the algorithmic problem of distributing the inference tasks between devices and the edge servers. We proposed a new inference load balancing algorithm that maximizes the inference accuracy while satisfying a delay constraint for the application. Further, we proposed a new distributed DL inference framework called Hierarchical Inference (HI). Using this framework, we did significant work on improving the efficiency of the DL inference systems. We also evaluated the performance of the HI systems using the measurements. The algorithmic strategies proposed in DIME enable reliable and energy-efficient inference on devices by augmenting their capabilities with large DL models in the cloud leading to the large-scale adoption of edge AI systems with significant societal and economic benefits

1) The primary objective of ML research so far has been to find/build models that provide the highest inference accuracy on the test dataset, while specialized hardware such as Graphic Processor Units (GPUs) and Tensor Processing Units (TPUs) have been used to accelerate the training and inference of the large ML models. Energy and latency per inference has received relatively less attention. This is especially true for embedded ML models, that have been designed in the recent past, to run on resource-constrained microcontroller units (MCUs). In DIME, we contributed to filling this gap by performing comprehensive measurements of accuracy, latency, and energy for running embedded Deep Learning models on five IoT devices ranging from a highly resource-constrained MCUs to a moderately powerful single-board computer with GPU. In particular, we studied two widely used commodity MCUs, Arduino Nano 33 BLE Sense and ESP32, one ML-specialized MCU, Coral Dev Board Micro, that has a TPU, and two popular single-board computers, Raspberry Pi, with a CPU, and Jetson Orin Nano, with a GPU, which are submitted in [Behera, 2024].

Early Exit with HI (EE-HI): Lower-latency on-device models are necessary for HI implementation. Toward this end, in [Behera, 2024], we proposed the EE-HI system by combining the HI system with the early exit technique [Teerapittayanon,2016], which reduces time per inference by introducing exit branches in between layers of the DL model. To this end, we modify the base models by adding early exit branches to create new local ML models. This allows for inferences to be accepted early based on the likelihood of the data sample being a simple sample. After the data samples are processed by the early exit local ML, HI is applied on top of it. We train a logistic regression module that decides if the local inference (at the final layer) for each data sample suffices or if it needs to be offloaded to the remote server for further inference. The proposed architecture of early exit with HI for Convolutional Neural Networks (CNNs) is shown in the figure below.

[Behera, 2024] Adarsh Prasad Behera, Paulius Daubaris, Iñaki Bravo, José Gallego, Roberto Morabito, Joerg Widmer, Jaya Prakash Varma Champati, “Exploring the Boundaries of On-Device Inference: When Tiny Falls Short, Go Hierarchical,” under submission, ACM Conference on Embedded Networked Sensor Systems (SenSys), 2024.

2) DIME contributed to new algorithmic solutions answering the question, where to schedule and perform the computation for ML inference (inference tasks) for applications from different end devices: locally on the device or on an edge server (near ML) or in the cloud (far ML)? Key aspects I considered in answering this question include, a) reliability and latency QoS requirement of the application, and b) energy consumption at the end devices and edge servers. Inference Scheduling: We studied the problem of partitioning the set of inference tasks between a tinyML model on a device and an edge server for maximizing inference accuracy under delay constraint. We designed and implemented novel approximation algorithms on Raspberry Pi. We also showed that the algorithms are asymptotically optimal [Fresa,2023]. Further, to address efficient inference task scheduling in edge servers, we built an edge inference-serving system InferEdge [Fresa,2024], which optimizes both model selection and resource allocation. A novel aspect of InferEdge is that it facilitates a weight parameter to tune the trade-off between inference accuracy and average service rate.

[Fresa,2023] Andrea Fresa, Jaya Prakash Champati, "Offloading Algorithms for Maximizing Inference Accuracy on Edge Device in an Edge Intelligence System," in IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 34, no. 7, pp. 2025-2039, July 2023.
[Fresa,2025] Andrea Fresa, Claudio Fiandrino, Joerg Widmer, Jaya Prakash Champati, “InferEdge: Online Model Selection and Resource Allocation for Efficient Inference at the Edge,” planned submission, IEEE INFOCOM, 2025.

3) DIME contributed to efficient distributed DL inference by proposing Hierarchical Inference (HI). In our work [Al-Atat, 2023], we showed the feasibility of HI for image classification applications. We used a simple threshold (for the confidence of the tinyML) for HI decisions in order to categorize if the classification output by the tinyML is correct or incorrect. Edge Impulse, a leading tinyML company, featured a post about [Ata23] citing HI as a promising distributed inference technique. In [Beh23], we further explored using linear regression and model calibration [Guo17] techniques to improve the HI decision making beyond the simple threshold rule. Furthermore, noting that the ground truth about the accepted inference is not known to the end device, in [Moo23], we studied an online learning problem which falls in the domain of Prediction with Expert Advice (PEA) with the added challenge of continuous experts. In an ongoing work [Dutta, 2024], we transformed the problem into PEA with growing number of experts. For a natural extension of the Hedge algorithm we proved O(√(T ln⁡〖N_T 〗 )) regret bound that improves over the lower bound O(√(TN_T )). Further, as the devices are computing limited, we developed an efficient online learning algorithm that has smaller runtime and has O(√(T ln⁡〖N_T 〗 )) regret [Al-Atat,2024].

[Al-Atat, 2023] Ghina Al-Atat, Andrea Fresa, Adarsh P. Behera, Vishnu N. Moothedath, James Gross, Jaya Prakash Champati, “The Case for Hierarchical Deep Learning Inference at the Network Edge”, in Proc. NetAI workshop, ACM Mobisys, 2023.

[Behera, 2023] Adarsh P. Behera, Roberto Morabito, Joerg Widmer, Jaya Prakash Champati, “Improved Decision Module Selection for Hierarchical Inference in Resource-Constrained Edge Devices”, in Proc. ACM MobiCom (short paper), 2023.

[Moothedath,2024] Vishnu N. Moothedath, Jaya Prakash Champati, James Gross, "Getting the Best Out of Both Worlds: Algorithms for Hierarchical Inference at the Edge," in IEEE Transactions on Machine Learning in Communications and Networking, vol. 2, pp. 280-297, 2024.

[Dutta, 2024] Puranjay Dutta, Jaya Prakash Champati, Sharayu Moharir, "Improved Regret Bounds for Growing Number of Experts," planned submission to Association for the Advancement of Artificial Intelligence (AAAI), 2025.

[Al-Atat,2024] Ghina Al-Atat, Puranjay Datta, Sharayu Moharir, Jaya Prakash Champati, “Regret Bounds for Online Learning for Hierarchical Inference,” under submission, ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), 2024.

The results of the DIME project significantly contributed to performing efficient DL inference at the edge. Existing techniques for DL inference at the edge can be broadly classified into: 1) on-device inference, where all the inferences are performed locally, 2) DNN-partitioning, where a DL models is split between the device and an edge server, and 3) remote inference, where all the data samples are offloaded and the inferences are performed on edge server/cloud. In this project, I advanced this line or research by proposing novel inference load balancing algorithms and designed the HI systems. Specifically, HI has been receiving considerable attention from other research groups working in the field of edge AI. The paper [Beytur, 2024], recently published in a top-tier conference IEEE INFOCOM, has extended the ideas of HI that I proposed in the DIME project.

[Beytur, 2024] Hasan Burhan Beytur, Ahmet Gunhan Aydin, Gustavo de Veciana, Haris Vikalo. “Optimization of Offloading Policies for Accuracy-Delay Tradeoffs in Hierarchical Inference”, in Proc. IEEE INFOCOM, 2024.

Further, the measurements study I did is unique and timely. They not served as a basis for evaluating different DL inference strategies on devices, but also are of independent interest to the embedded ML research community. We made them available for free access on Github. I also plan to submit the measurements as a tinyML benchmark at MLCommons, a non-profit organization that aims to accelerate machine learning innovation. This will further increase DIME’s positive impact on society through its use by industry as well.

Hierarchical Inference framework for distributed DL inference

Hierarchical Inference with Early Exit technique in CNN

Periodic Reporting for period 1 - DIME (Distributed Inference for Energy-efficient Monitoring at the Network Edge)

Compartir esta página Compartir esta página en las redes sociales

Descargar Descargar el contenido de la página