Periodic Reporting for period 1 - CloudButton (Serverless Data Analytics Platform)
Reporting period: 2019-01-01 to 2020-06-30
Serverless computing offers new opportunities for extreme-scale analytics, by allowing to run embarrassingly parallel computations with an extraordinary simplicity and an unlimited scalability thanks to the automated management of the cloud resources. Leveraging this, our novel programming abstractions and tools will considerably simplify the deployment of data analytics code to the cloud. The main objective of the project is then to create CloudButton: a Serverless Data Analytics Platform. In order to achieve this ambitious objective, the project defines the following goals:
- Create a High Performance Serverless Compute Engine for Big Data. This is the foundational technology for the CloudButton platform that must overcome the current limitations of existing serverless platforms. In particular, it includes extensions to i) support stateful and highly performant execution of serverless tasks, ii) optimized elasticity and operations management of functions thanks to new locality aware scheduling algorithms, iii) efficient QoS management of containers that host serverless functions, and iv) a Serverless Execution Framework supporting typical dataflow models.
- Support for Mutable Shared Data in Serverless Computing. To simplify the transitioning from sequential to (massively-)parallel code, we will design a new middleware that allows to quickly spawn and share mutable data structures in a serverless computing platform. Our Mutable Shared Data middleware will i) offer an easy-to-use programming framework to add state to serverless computing, ii) provide dynamic data replication and tunable consistency to match the performance requirements of serverless data analytics, and iii) integrate this framework to an in-memory data grid for performance.
- Design novel Serverless Cloud Programming Abstractions: To provide a new programming model for serverless cloud infrastructures that can express a wide range of existing data-intensive applications with minimal changes. The programming model should at the same time, i) preserve the benefits of a serverless execution model in terms of resource efficiency, performance, scalability and fault tolerance, ii) explicit support for stateful functions in applications, while offering guarantees with respect to the consistency and durability of the state.
The main results produced during this reporting period can be summarized as follows:
- Development of PyWren for IBM Cloud: https://github.com/pywren/pywren-ibm-cloud
- Development of new Serverless APIs that support the interception of standard Python libraries: https://github.com/cloudbutton/cloudbutton
- Design and implementation of a serverless orchestration layer over Knative: https://github.com/triggerflow/triggerflow
- Design and implementation of Crucial: a Java Serverless Toolkit that provides a serverless executor and shared mutable state and synchronization primitives: https://github.com/crucial-project
- Extension of Infinispan data store to improve its scalability and provide interconnection with CloudButton toolkit.
- Porting of existing C++ HPC applications using MPI and OpenMP with Faasm.
- Design and implementation of Python multiprocessing library over IBM PyWren.
- Porting of scikitlearn over multiprocessing library.
The results achieved during the first part of the project have been published in top academic conferences like ACM/IFIP Middleware'19, USENIX ATC'20, and ACM DEBS'20. Moreover, project outcomes have been presented at important events like Strata Data NY, and invited talks at Intel Labs and IBM Watson Research.
The project has established KPIs to evaluate serverless data analytics technologies around simplicity/productivity, cost/performance, and scalability. Use cases are being evaluated according to these KPIs.
First, the consortium members contribute to five open source projects that may have strong impact in cloud communities: IBM-PyWren, Infinispan, Crucial, Faasm, and CloudButton benchmark.
Second, the novel serverless approaches developed in the context of the CloudButton project have already proven successful in one of the use cases. The CloudButton toolkit was used to process massive spatial metabolomics datasets. For example, initial benchmarks showed that the serverless approach managed to process certain datasets in less than an hour, while it took at least four hours in an Apache Spark cluster.
And third, CloudButton components have been used in data analytics contexts outside the three settings selected for the project: metabolomics, genomics and geospatial data. In particular, IBMPyWren repository shows examples of usages in the fields of finance (stock price prediction) and molecular dynamics simulations.
In the following stages of the project we aim at consolidating and integrating the platform and promoting CloudButton in open source communities.