Periodic Reporting for period 1 - DIME (Distributed Inference for Energy-efficient Monitoring at the Network Edge)
Reporting period: 2022-06-01 to 2024-05-31
In order to understand the trade-offs between accuracy, latency, and on-device energy consumption per inference, in this project we conducted a comprehensive measurement study on multiple devices that span different processor types, including CPU, GPU and TPU, using datasets of varying complexity. Further, to assess the performance of distributed DL inference between a device and a server, we also conducted measurements for offloading the data samples using different communication protocols WiFi and Bluetooth, and the measurements for runtimes of state-of-the-art DL models on two distinct servers, one with NVDIA Tesla GPU and the other with A100 GPU.
We also studied the algorithmic problem of distributing the inference tasks between devices and the edge servers. We proposed a new inference load balancing algorithm that maximizes the inference accuracy while satisfying a delay constraint for the application. Further, we proposed a new distributed DL inference framework called Hierarchical Inference (HI). Using this framework, we did significant work on improving the efficiency of the DL inference systems. We also evaluated the performance of the HI systems using the measurements. The algorithmic strategies proposed in DIME enable reliable and energy-efficient inference on devices by augmenting their capabilities with large DL models in the cloud leading to the large-scale adoption of edge AI systems with significant societal and economic benefits
Early Exit with HI (EE-HI): Lower-latency on-device models are necessary for HI implementation. Toward this end, in [Behera, 2024], we proposed the EE-HI system by combining the HI system with the early exit technique [Teerapittayanon,2016], which reduces time per inference by introducing exit branches in between layers of the DL model. To this end, we modify the base models by adding early exit branches to create new local ML models. This allows for inferences to be accepted early based on the likelihood of the data sample being a simple sample. After the data samples are processed by the early exit local ML, HI is applied on top of it. We train a logistic regression module that decides if the local inference (at the final layer) for each data sample suffices or if it needs to be offloaded to the remote server for further inference. The proposed architecture of early exit with HI for Convolutional Neural Networks (CNNs) is shown in the figure below.
[Behera, 2024] Adarsh Prasad Behera, Paulius Daubaris, Iñaki Bravo, José Gallego, Roberto Morabito, Joerg Widmer, Jaya Prakash Varma Champati, “Exploring the Boundaries of On-Device Inference: When Tiny Falls Short, Go Hierarchical,” under submission, ACM Conference on Embedded Networked Sensor Systems (SenSys), 2024.
2) DIME contributed to new algorithmic solutions answering the question, where to schedule and perform the computation for ML inference (inference tasks) for applications from different end devices: locally on the device or on an edge server (near ML) or in the cloud (far ML)? Key aspects I considered in answering this question include, a) reliability and latency QoS requirement of the application, and b) energy consumption at the end devices and edge servers. Inference Scheduling: We studied the problem of partitioning the set of inference tasks between a tinyML model on a device and an edge server for maximizing inference accuracy under delay constraint. We designed and implemented novel approximation algorithms on Raspberry Pi. We also showed that the algorithms are asymptotically optimal [Fresa,2023]. Further, to address efficient inference task scheduling in edge servers, we built an edge inference-serving system InferEdge [Fresa,2024], which optimizes both model selection and resource allocation. A novel aspect of InferEdge is that it facilitates a weight parameter to tune the trade-off between inference accuracy and average service rate.
[Fresa,2023] Andrea Fresa, Jaya Prakash Champati, "Offloading Algorithms for Maximizing Inference Accuracy on Edge Device in an Edge Intelligence System," in IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 34, no. 7, pp. 2025-2039, July 2023.
[Fresa,2025] Andrea Fresa, Claudio Fiandrino, Joerg Widmer, Jaya Prakash Champati, “InferEdge: Online Model Selection and Resource Allocation for Efficient Inference at the Edge,” planned submission, IEEE INFOCOM, 2025.
3) DIME contributed to efficient distributed DL inference by proposing Hierarchical Inference (HI). In our work [Al-Atat, 2023], we showed the feasibility of HI for image classification applications. We used a simple threshold (for the confidence of the tinyML) for HI decisions in order to categorize if the classification output by the tinyML is correct or incorrect. Edge Impulse, a leading tinyML company, featured a post about [Ata23] citing HI as a promising distributed inference technique. In [Beh23], we further explored using linear regression and model calibration [Guo17] techniques to improve the HI decision making beyond the simple threshold rule. Furthermore, noting that the ground truth about the accepted inference is not known to the end device, in [Moo23], we studied an online learning problem which falls in the domain of Prediction with Expert Advice (PEA) with the added challenge of continuous experts. In an ongoing work [Dutta, 2024], we transformed the problem into PEA with growing number of experts. For a natural extension of the Hedge algorithm we proved O(√(T ln〖N_T 〗 )) regret bound that improves over the lower bound O(√(TN_T )). Further, as the devices are computing limited, we developed an efficient online learning algorithm that has smaller runtime and has O(√(T ln〖N_T 〗 )) regret [Al-Atat,2024].
[Al-Atat, 2023] Ghina Al-Atat, Andrea Fresa, Adarsh P. Behera, Vishnu N. Moothedath, James Gross, Jaya Prakash Champati, “The Case for Hierarchical Deep Learning Inference at the Network Edge”, in Proc. NetAI workshop, ACM Mobisys, 2023.
[Behera, 2023] Adarsh P. Behera, Roberto Morabito, Joerg Widmer, Jaya Prakash Champati, “Improved Decision Module Selection for Hierarchical Inference in Resource-Constrained Edge Devices”, in Proc. ACM MobiCom (short paper), 2023.
[Moothedath,2024] Vishnu N. Moothedath, Jaya Prakash Champati, James Gross, "Getting the Best Out of Both Worlds: Algorithms for Hierarchical Inference at the Edge," in IEEE Transactions on Machine Learning in Communications and Networking, vol. 2, pp. 280-297, 2024.
[Dutta, 2024] Puranjay Dutta, Jaya Prakash Champati, Sharayu Moharir, "Improved Regret Bounds for Growing Number of Experts," planned submission to Association for the Advancement of Artificial Intelligence (AAAI), 2025.
[Al-Atat,2024] Ghina Al-Atat, Puranjay Datta, Sharayu Moharir, Jaya Prakash Champati, “Regret Bounds for Online Learning for Hierarchical Inference,” under submission, ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), 2024.
[Beytur, 2024] Hasan Burhan Beytur, Ahmet Gunhan Aydin, Gustavo de Veciana, Haris Vikalo. “Optimization of Offloading Policies for Accuracy-Delay Tradeoffs in Hierarchical Inference”, in Proc. IEEE INFOCOM, 2024.
Further, the measurements study I did is unique and timely. They not served as a basis for evaluating different DL inference strategies on devices, but also are of independent interest to the embedded ML research community. We made them available for free access on Github. I also plan to submit the measurements as a tinyML benchmark at MLCommons, a non-profit organization that aims to accelerate machine learning innovation. This will further increase DIME’s positive impact on society through its use by industry as well.