Periodic Reporting for period 4 - Hi-EST (Holistic Integration of Emerging Supercomputing Technologies)
Reporting period: 2019-11-01 to 2020-04-30
Supercomputers are critical infrastructures to address Grand challenges in the field of Bioinformatics, Physics, and Earth Sciences. Supercomputing is a crucial asset for the EU's innovation capacity, being a worldwide market worth EUR 14 billion, with Europe representing a 35% of it, from which two thirds of this market depends on public funding. The goal of Hi-EST is to perform a significant advance in the field of methods, mechanisms and algorithms for the integrated management of heterogeneous supercomputing workloads. This will result in a more efficient management of the infrastructure, in an automated way that will continuously adjust the number and type of resource allocated to each workload. Hi-EST aims to address this problem for the next generation infrastructures and workloads, entering into an still unexplored space: the intersection of classic HPC workloads and data-driven, real-time and interactive applications on the one hand; and a set of emerging technologies combining persistent memories, key/value stores, RDMA-based devices, and GPUs on the other. In this scope, Hi-EST takes advantage of the ultra-large size of exa-scale systems to develop advanced real-time performance learning techniques and data and task placement algorithms to address this NP-hard problem for this novel scenario. The placement decisions are continuously enforced through the use of Software Defined Environments.
In summary, Hi-EST addressed the following objectives:
1. Advance research frontiers in Adaptive Learning Algorithms by proposing Deep Learning techniques for guiding task and data placement decisions, and the first known Adaptive Learning Architecture for exa-scale Supercomputers.
2. Advance research frontiers in Task Placement and Scheduling by proposing novel topology-aware workload placement strategies, and by extending unifying performance models for heterogeneous workloads to cover and unprecedented number of workload types.
3. Advance research frontiers in Data Placement strategies by studying data stores on top of heterogeneous sets of key/value stores connected to Active Storage technologies, as well as by proposing the first known uniform API to access hierarchical data stores based on key/value stores.
4. Advance research frontiers in Software Defined Environments by developing policies for the upcoming disaggregated data centres, and by creating placement algorithms that combine data and task placement into one single decision-making process
In summary, Hi-EST addressed the following objectives:
1. Advance research frontiers in Adaptive Learning Algorithms by proposing Deep Learning techniques for guiding task and data placement decisions, and the first known Adaptive Learning Architecture for exa-scale Supercomputers.
2. Advance research frontiers in Task Placement and Scheduling by proposing novel topology-aware workload placement strategies, and by extending unifying performance models for heterogeneous workloads to cover and unprecedented number of workload types.
3. Advance research frontiers in Data Placement strategies by studying data stores on top of heterogeneous sets of key/value stores connected to Active Storage technologies, as well as by proposing the first known uniform API to access hierarchical data stores based on key/value stores.
4. Advance research frontiers in Software Defined Environments by developing policies for the upcoming disaggregated data centres, and by creating placement algorithms that combine data and task placement into one single decision-making process
The team has explored the automatic generation of performance models using unsupervised learning pipelines, and introduced a novel method for modelling and discovering phases in time-series in an unsupervised way, by using Conditional Restricted Boltzmann Machines (CRBM). It also proposed a novel use of Recurrent Neural Networks (RNN) to perform sequence-to-sequence translations of resource consumptions associated to workloads in other to estimate co-located workload interference in the data centre. The team also developed a Monoid Tree Aggregator general sliding window aggregation framework with amortized O(1) time complexity and a worst-case of O(log n) between insertions. The team also contributed a novel topology-aware workload placement strategy to schedule accelerated jobs on multi-GPU systems.
During the project, SMUFIN: a somatic mutation finder software, developed in the Barcelona Supercomputing Center, was optimized to take advantage of the project advances. The work performed illustrated how the data intensive nature of processing data in the human genome is still a computational and memory challenge. However, we described techniques and mechanisms to overcome the memory challenge and to alleviate the computational one. In particular, we demonstrated how accelerators can be used to shuffle data to minimize interthread communication and how it can cooperate with the CPU to build large Bloom filters. Results showed that the Barcelona Supercomputing Center (BSC) is now able to process the genomes of approximately 250 patients per MWh of energy consumed, while with the previous generation of the pipeline, the output was of approximately 18 patients per MWh of energy consumed.
During the length of the project, 22 research papers were accepted for publication (10 journals and 12 conference papers), contributing results across the four research pillars. Additionally, three patents were filed, and a spinoff (Nearby Computing) created out of the results of the project.
During the project, SMUFIN: a somatic mutation finder software, developed in the Barcelona Supercomputing Center, was optimized to take advantage of the project advances. The work performed illustrated how the data intensive nature of processing data in the human genome is still a computational and memory challenge. However, we described techniques and mechanisms to overcome the memory challenge and to alleviate the computational one. In particular, we demonstrated how accelerators can be used to shuffle data to minimize interthread communication and how it can cooperate with the CPU to build large Bloom filters. Results showed that the Barcelona Supercomputing Center (BSC) is now able to process the genomes of approximately 250 patients per MWh of energy consumed, while with the previous generation of the pipeline, the output was of approximately 18 patients per MWh of energy consumed.
During the length of the project, 22 research papers were accepted for publication (10 journals and 12 conference papers), contributing results across the four research pillars. Additionally, three patents were filed, and a spinoff (Nearby Computing) created out of the results of the project.
A common belief in the supercomputing field is that the future bottleneck for many data intensive scientific advances will be on the technological approach used currently used to process data. This situation is clearly visible observing at the sustained growth rate of the GenBank genome database from NCBI, or the experiences in the Atlas Project in the process to discover the Higgs boson at the Large Hadron Collider (LHC). In both cases, the scientific grand challenge resembles more a data mining problem than a classical CPU-intensive supercomputing project. But they need, in both cases, the ultra-high capacity of supercomputers to address the problems that each project aims. The extremely large scale of next generation exa-scale supercomputers results in that every monitoring effort becomes a challenge also in terms of data management. But the same challenge is also an opportunity, because the amount of information that can be collected in real time from an exa-scale infrastructure opens an opportunity for advanced learning techniques to generate knowledge about the applications being run. Another source of changes in the supercomputing domain is the enormous interest by the Public Sector and Social Scientists on the so called Smart cities and the Internet of Things (or more recently the Internet of Everything). These workloads require the processing unbounded data streams that pose enormous challenges to the processing centres, demanding in many cases the computing capacity of supercomputers.
For this reason, there is an urging need to perform a significant advance in the field of methods, mechanisms and algorithms for the integrated management of heterogeneous supercomputing workloads. This is a huge challenge since management of a homogeneous set of distributed workloads on homogeneous infrastructures is already an NP-hard problem to solve. The level of dynamism expected for future generation supercomputing significantly raises the complexity of the problem. Addressing this grand challenge is the ultimate goal of the Hi-EST project.
In particular, Hi-EST plans to advance research frontiers in four different areas:
1. Adaptive Learning Algorithms: by proposing a novel use of Deep Learning techniques for guiding task and data placement decisions;
2. Task Placement: by proposing novel algorithms to map heterogeneous sets of tasks on top of systems enabled with Active Storage capabilities, and by extending unifying performance models for heterogeneous workloads to cover and unprecedented number of workload types;
3. Data Placement: by proposing novel algorithms to map data on top of heterogeneous sets of key/value stores connected to Active Storage technologies; and
4. Software Defined Environments (SDE): by extending SDE description languages with a still inexistent vocabulary to describe Supercomputing workloads that will be leveraged to combine data and task placement into one single decision-making process.
For this reason, there is an urging need to perform a significant advance in the field of methods, mechanisms and algorithms for the integrated management of heterogeneous supercomputing workloads. This is a huge challenge since management of a homogeneous set of distributed workloads on homogeneous infrastructures is already an NP-hard problem to solve. The level of dynamism expected for future generation supercomputing significantly raises the complexity of the problem. Addressing this grand challenge is the ultimate goal of the Hi-EST project.
In particular, Hi-EST plans to advance research frontiers in four different areas:
1. Adaptive Learning Algorithms: by proposing a novel use of Deep Learning techniques for guiding task and data placement decisions;
2. Task Placement: by proposing novel algorithms to map heterogeneous sets of tasks on top of systems enabled with Active Storage capabilities, and by extending unifying performance models for heterogeneous workloads to cover and unprecedented number of workload types;
3. Data Placement: by proposing novel algorithms to map data on top of heterogeneous sets of key/value stores connected to Active Storage technologies; and
4. Software Defined Environments (SDE): by extending SDE description languages with a still inexistent vocabulary to describe Supercomputing workloads that will be leveraged to combine data and task placement into one single decision-making process.