Periodic Reporting for period 2 - UniServer (A Universal Micro-Server Ecosystem by Exceeding the Energy and Performance Scaling Boundaries)
Période du rapport: 2017-08-01 au 2019-07-31
In order to substantially improve the energy-efficiency there is a need to design new error-resilient server ecosystems that are able to deal with the increased hardware variability in a more intelligent way than the conventional pessimistic paradigms. UniServer project turns the table around and puts forth the following question: Why allow the worst operating margins of fabricated chips to artificially constrain the performance and energy of today systems? The reality is that each manufactured processor and each memory module is inherently different and lies on a distinct performance bin, meaning by that each chip has different capabilities in terms of energy-efficiency and performance. According to UniSever overall design target, the computing industry needs to see such heterogeneity not as a problem but as an opportunity to improve energy-efficiency especially in next generation servers. ‘Functional heterogeneity’ has already been adopted in embedded systems and servers with hybrid CPU/GPU/accelerators architectures. Therefore, it is now time to also expose the ‘intrinsic heterogeneity’, harness it and use it to our advantage by redesigning the hardware and software for improving energy-efficiency or performance that is essential for realizing the microservers that are needed in support of the imminent IoT revolution. Based on such observation, UniServer approach plans to substitute the existing conservative margins with the real capabilities of each individual core and memory-array. This will enable us to exceed the energy and performance scaling boundaries adopted in servers.
In particular, the cores and memories have already being characterised under various conditions with results for cores, caches and DRAM showing significant design margins that can be exploited within the UniServer concept. By month 18, the definition of the Hardware Exposure Interface (HEI) and the error handling procedures have been defined and implemented on the first prototype. Also, a first beta version of the HealthLog and the StressLog monitors has been implemented, while the interface between the Predictor and the other software components of the UniServer platform has been defined and started being ported on the the initial prototype.
We have also quantified the intensity of the use of hypercalls and system calls at the hypervisor level and a fault injection infrastructure which is already used to identify the invariable impact of potential faults on various structures of the system software. Our analysis show that there are necessary steps to enable intelligent, selective protection. and the sensitivity of different data structures and code modules of the hypervisor at both the user and kernel level. To this end, we have started implementing mechanisms to increase the resilience of the hypervisor against CPU faults (functionality migration of sensitive system code to reliable cores) and memory faults (support for heterogeneous reliability memory through different memory zones). Resilient mechanisms and enhanced monitoring capabilities have also been defined and enabled at the OpenStack layer. During this period, all applications have been collected and ported on the UniServer board and initial results have started being collected against metrics of success that have also been defined in this period. The project ideas and results have been published in numerous publications in top tier venues and were disseminated through two organized workshops, numerous talks, the project website and the social media channels.
Besides addressing the power and variability challenge, the envisioned ecosystem also contributes to assure sustainability, programmability and address privacy/security concerns by running the services at the Edge complementary to the Cloud. Services running at the Edge relieve the public network from the Big Data burden and at the same time ensure the required quality-of-service in response latency sensitive IoT services. The complete software ecosystem also allows to seamlessly administer cloud and edge data-centers lessening the programmability effort that will be otherwise required for porting a service to specialized hardware in the cloud. Finally, the ability of edge resources to provide a complete service within a home or the premises of a small enterprise naturally lends itself to improved privacy since the data do not need to be communicated through the public network and reside in third party data-centers.
Overall, the realization of the envisioned error-resilient ecosystem for energy-efficiency is paved with many challenges as detailed above since radically new technologies need to be developed with the assistance of hardware and software developers. UniServer consortium brings together a team of academic institutions and world-leading industrial partners that are actively working towards realizing and evaluating the potential benefits of such a vision which already shows potential to break the conventional pessimistic limits of performance and energy-efficiency.