Periodic Reporting for period 1 - MLSysOps (Machine Learning for Autonomic System Operation in the Heterogeneous Edge-Cloud Continuum)
Reporting period: 2023-01-01 to 2023-12-31
However, the rise of cloud-edge-IoT systems and the plethora of different computing and sensing devices that are involved in modern applications further aggravates the challenging task of monitoring and managing heterogeneous and distributed resources, this time at an extreme scale, making human-in-the-loop management completely unrealistic. The solution is to make computing systems “autonomic” so that they can manage themselves based on high-level objectives from the application owners and system administrators.
The goal of the MLSysOps project is to support autonomic system management across the cloud-edge-IoT continuum, using a combination of AI and ML methods. MLSysOps will develop a hierarchical agent framework on top of different resource management and application deployment/orchestration mechanisms. To achieve adaptivity, the agents will incorporate continual ML model learning in tandem with intelligent retraining during application execution. The project emphasizes openness and extensibility, by dissociating management from control and employing explainable ML methods and an API for pluggable ML models. Energy efficiency and utilization of green energy, performance, low latency, efficient tier-less storage, cross-layer orchestration including resource-constrained devices, resilience to imperfections of physical networks, trust, and security, are key aspects we intend to address in the project.
More specifically, the MLSysOps project has the following key objectives: (1) Deliver an open AI-ready, agent-based framework for holistic, trustworthy, scalable, and adaptive system operation across the heterogeneous cloud-edge continuum. (2) Develop an AI architecture supporting explainable, efficiently retrainable ML models for end-to-end autonomic system operation in the cloud-edge continuum. (3) Enable efficient, flexible, and isolated execution across the heterogeneous continuum. (4) Support green, resource-efficient, and trustworthy system operation, while satisfying application QoS/QoE requirements. (5) Perform realistic model training, validation, and evaluation.
The technology that will be developed in the MLSysOps project will be tested and evaluated using several research testbeds and two real-world application testbeds in the domain of smart cities and smart agriculture. Additionally, system simulators will enable scale-out experiments that cannot be performed using the available testbeds.
In total 38 requirements were defined, structured in 10 requirements groups (RGs) according to the different application and system management aspects tackled by the MLSysOps project: structured system and application descriptions; application deployment and orchestration; node-level resource usage & management; storage; trust; wireless network management and security at the edge; 5G network management; optical networking in the datacenter; energy-efficient and green-computing in datacenters; machine learning. Each requirement group comes with its own KPIs, which will be verified using different simulation environments and testbeds according to the evaluation plan.
The MLSysOps framework architecture was designed to capture the above aspects. It adopts a hierarchical, agent-based approach whereby different types of interacting agents are responsible for monitoring and controlling management and configuration aspects that are relevant in each of the layers of the edge-cloud-IoT continuum. Also, the architecture defines the northbound interfaces for the interaction between the framework and the system administrator / application owner, and the southbound interfaces for the interaction with the different mechanisms that can be used to execute the management and configuration operations decided by the agents. Lastly, the architecture specifies the interface used for the flexible registration, selection, deployment, invocation, and retraining of the different machine learning (ML) models.
First versions of key supporting mechanisms were developed, including: the telemetry backbone for the forwarding of the different system- and application-level performance and resource usage metrics; node-level configuration knobs for different node platforms; portable and flexible usage of accelerators for efficient computing; storage adaptation strategies and processes for changing the storage policy of a storage bucket; configuration of 5G operation in terms of UPF placement and usage; flexible deployment and orchestration of application components across the cloud-edge-IoT continuum; support for code deployment on resource constrained far-edge devices; agents running on different physical nodes interacting over the network.
Initial work on ML models was performed, though the bulk of the work is planned for the next project phase. Reinforcement Learning (RL) will be used to observe the state of a system and determine what actions to perform to optimize specific metrics and maximize its rewards. Explanations will be provided for each RL result to document which parts of the input were most important for the action taken by an ML model at a particular time but also across many actions over a time window.
Finally, there was significant progress in terms of the environments that will be used for testing and evaluating all the above. Five different simulation environments and five research testbeds have been prepared, covering the aspects of IoT/edge computing and networking, 5G, optical datacenter networking, multi-datacenter computing and drone-based computing. Also, two application testbeds have been prepared to experiment with smart city and smart agriculture scenarios in the real world.
In terms of potentially exploitable technological results, project partners developed mechanisms for the real-time configuration change of S3 object storage buckets, the flexible selection of accelerated function implementations, and a container runtime for unikernels.
The bulk of technological results will be produced during the second and third year of the project, where different parts of the MLSysOps framework will also be released to the community.