Skip to main content

Self-Organising, Self-Managing Heterogeneous Cloud

Periodic Reporting for period 2 - CloudLightning (Self-Organising, Self-Managing Heterogeneous Cloud)

Reporting period: 2016-08-01 to 2018-01-31

Clouds have revolutionised the computing landscape by offering a variety of service models on top of shared commodity hardware. Recent years have seen the cloud landscape transforming to accommodate more and more sophisticated services including HPC-like services, Big Data Analytics, Machine Learning and Artificial Intelligence. The success in providing and extending these services depends on how well the cloud can manage heterogeneity - not only in the services being offered but also in the software tools and hardware needed to support them. CloudLightning provides a novel, single, extensible architecture for the next generation of cloud computing. It is a framework for managing heterogeneity at any scale capable of provisioning heterogeneous cloud resources to deliver services, specified by the user, using a bespoke service description language. This framework coherently addresses a number of topical issues in cloud computing, including: incorporating and dynamically constructing HPC environments, efficiently managing heterogeneous resources at scale (e.g. utilisation and power consumption), incorporating new hardware and new types of hardware readily, addressing over-provisioning by using profiled services, making complex service workflows available through a Blueprint-as-a-Service (BPaaS) delivery model and automating service discovery, resource selection and service deployment without having to use resource reservation.

The CloudLightning architecture is designed to be highly scalable and extensible (via a novel Plug and Play mechanism) embracing different types of heterogeneity. A unique feature of this approach is that it facilitates both the incorporation and the dynamic construction of HPC environments. In the former case, HPC machines can be added to the CloudLightning resource fabric by registering the resource manager of the HPC machine as a CloudLightning resource. In the latter case, HPC-like environments can be dynamically constructed, in response to support a particular service, from resources co-located on the same low-latency network. Thus, providing a mechanism to offer HPC-as-a-Service.

An important objective of CloudLightning was to remove the burdens of low-level service provisioning, optimisation and orchestration from the cloud consumer. A related objective was to locate decisions pertaining to resource usage with individual resource components, where optimal decisions could be made. To achieve these objectives, a system was created, composed of a hierarchy of resource managers and employing self-organisation and self-management strategies. By addressing the inefficient use of resources CloudLightning can facilitate savings to the cloud provider and the cloud consumer through reduced power consumption and improved service delivery, with hyperscale systems particularly in mind.
The CloudLightning architecture was created as a number of interacting components, which work in concert to execute a workflow of services (in a BPaaS delivery model); constructed from a catalogue of specialised profiled services and subsequently deployed on an appropriate set of underlying (heterogeneous) resources. This collection of resources is determined without user intervention to maximise both the QoS parameters associated with the Blueprint and non-functional objectives (such as maximising utilisation, minimising energy consumption and management overhead) of the CSP.

The main components developed to deliver the CloudLightning system include: a Gateway Service, comprised of a User Interface for Blueprint creation and Service Lifecycle Management; the Self-Organising and Self-managing system (SOSM), comprised of a novel, sophisticated, routing network designed to allow service requests to autonomously navigate towards the most appropriate (set of) resource(s) to provision that request; a Plug and Play Mechanism to allow for the dynamic registration (and subsequent deregistration, if required) of resources and associated telemetry endpoints; a Universal Telemetry Interface, allowing different telemetry systems, that may be associated with the various resources in the CloudLightning fabric, to be queried in a uniform manner.

CloudLightning was realised and exercised end to end on a testbed of heterogeneous resources comprised of CPUs, Graphics Processing Units, Many Integrated Cores (MIC) and Data Flow Engines (DFEs). The system was evaluated using three primary use cases: Oil and Gas, Genomics, and Ray Tracing. These HPC-like use cases were containerised, converted to cloud applications and traces from their execution were gathered and used as input to the large-scale simulation of the SOSM system. The large-scale simulation activity examined scalability, power consumption, computational efficiency and resource utilisation and used this information to compare the efficiency of the SOSM system with traditional cloud resource allocation schemes. An analysis of the results show that the SOSM system compares very favourably with traditional methods and thus is a viable approach for future cloud resource management, particularly with respect to the added complexity associated with the management of the emerging heterogeneous cloud.

The results of the CloudLightning project have been disseminated through the publication of an open access book: 'Heterogeneity, High Performance Computing, Self-Organization and the Cloud'; the publication of 47 peer-reviewed scientific publications; the presentation of the work at 33 conferences, in addition to one organised by the CloudLightning consortium; the participation in 15 workshops, in addition to 4 workshops organised by the consortium; the development and delivery of a MOOC 'High Performance Computing in the Cloud' in collaboration with FutureLearn; the organisation of 4 industry briefings to engage with industry stakeholders; and the publications of 90 articles in non-scientific and non-peer-reviewed publications.
The CloudLightning project added significantly to the technical state of the art as evidenced by 47 peer-reviewed scientific publications and contributions to the standards body IEEE 2302 - Standard for Intercloud Interoperability and Federation (SIIF).

The novel contributions of the project include the design and implementation of a self-organising, self-managing heterogeneous, service-oriented cloud architecture, and its constituent components, based on a blueprint-as-a-service deployment model and supporting separation of concerns. The CloudLightning use cases were made cloud-friendly through containerisation, and made available through an HPC-as-a-Service deployment model. Finally, a bespoke simulation framework, simulating dynamic resource allocation schemes in complex hyper-scale heterogeneous cloud infrastructures was constructed.

The impacts of the CloudLightning project include: simplifying the operational overhead of deploying services and reducing the complexity in deploying HPC workloads on traditional cloud resources and when using heterogeneous resources; the potential to increase energy efficiency resulting in an attractive cost structure for service providers who can also make use of the directed evolution present in the CloudLightning system to evolve a cloud configuration that is appropriate to their business needs.
CloudLightning Architecture Overview Diagram