Skip to main content

Software Defined Storage for Big Data

Periodic Reporting for period 2 - IOSTACK (Software Defined Storage for Big Data)

Reporting period: 2016-07-01 to 2017-12-31

Storage management of Big Data analytics in large and multi-tenant clusters is a complex and time-consuming task. Today, many companies and organizations suffer from lack of automation in their daily Big Data analytics management operations, which hurdles their competitiveness and efficiency. This is especially true if we consider heterogeneous workloads and variable, non-anticipated tenant requirements that should be satisfied within stringent time limits. In response to these challenges, Software-Defined Storage (SDS) has recently become a prime candidate to simplify storage management in the cloud.

The main objective of IOStack is to create a Software-defined Storage toolkit for Big Data on top of the OpenStack platform. IOStack will enable efficient execution of virtualized analytics applications over virtualized storage resources thanks to flexible, automated, and low cost data management models based on SDS. In order to achieve this general goal, IOStack focuses on the following objectives:

G-1. Storage and compute disaggregation and virtualization. The objective is to create a virtual model for compute, storage and networking that allows orchestration tools to manage resources in an efficient manner.
G-2. SDS Services for Analytics. The objective is to define, design, and build a stack of SDS data services enabling virtualized analytics with improved performance and usability.
G-3. Orchestration and deployment of Big Data analytics services. The objective is to design and build efficient deployment strategies for virtualized analytic-as-a-service instances.

IOStack project has fulfilled all the aforementioned objectives, in particular:
- Regarding G-1, the project obtained high impact in the storage community by presenting a ground-breaking SDS Architecture for object and block storage (presented in FAST'17). As a consequence we produced an SDS toolkit for object and block storage that demonstrated the potenctial of this technology. We also demonstrated in use cases how SDS technologies can reduce costs and simplify the data management of growing datasets in companies and institutions.
- With regard to G-2, the project has demonstrated, using a clear list of Key Performance Indicators, significant improvements in performance and data throughput, applying SDS services (computation close to the data, data reduction,...) to data analytics tasks. We also demonstrated in use cases how our SDS services reduce IO bottlenecks in data intensive analysis of massive unstructured data.
- Finally, for the objective G-3, the project has pioneered the orchestration and deployment of heterogeneous data-intensive analytics services. We devised advanced scheduling mechanisms including a flexible heuristic to schedule analytic applications that aims at high system responsiveness by allocating resources efficiently. We demonstrated in use cases how orchestration technologies can optimize the execution of heterogeneous analytics applications in the cloud.
IOStack activities carried out during the project have been aligned with the project objectives, and towards the consecution of the project outcomes. Efforts have been devoted to the analysis, design and development of the IOStack SDS toolkit, a software stack for Big Data Analytics on top of OpenStack. The main results produced can be summarized as follows:

- Design and implementation of Konnector: an SDS framework for block storage (OpenStack Cinder) that enables the interception of storage flows from block volumes in order to optimize storage workloads of analytics applications.

- Design and implementation of Crystal: an open, extensible and meta-programmable SDS framework for object storage (OpenStack Swift) that provides simplified policy-based storage management to system administrators.

- Design and implementation of Zoe, a general purpose cluster management and scheduler that is able to deploy and schedule analytics applications (clusters) that use a variety of large-scale computing frameworks (e.g Spark, TensorFlow).

- We have open-sourced and improved OpenStack Storlet as a core component of executing compute tasks on object storage. Furthermore, Storlets are already being used by different companies and organizations around the world.

- Design and implementation of Stocator, a high performance storage connector, that enables Hadoop-based analytics engines to work directly on data stored in object storage systems. It was presented at Spark Summit 2017 and it is already in production in IBM Analytics for Apache Spark, an IBM Cloud service.

- Spark data reduction mechanisms: On the one hand, we augmented Spark with the ability to delegate computations to the storage cluster (via Crystal/Storlets). On the other hand, we devised a data reduction technique (Pluggable Spark SQL Filters) that dynamically filters irrelevant objects during query execution, thus accelerating Spark Big Data analytics on data stored in object stores.
Both mechanisms enables use case companies like GridPocket to perform queries over their data much faster.

- Publication of research papers in high-level conferences/journals such as FAST, IMC, ICDE, Middleware, Internet Computing, etc.

- Dissemination of the project in major community events such as OpenStack Summit and Spark Summit and other industrial/commercial events (CloudScape, IBM InterConnect).

All open source software results of IOStack are available at:
IOStack has achieved important milestones that can have significant impact on the research, industry and open source community.

First, we built an innovative SDS platform for OpenStack: the most important open source cloud community. That is, IOStack is designed to separate the control and data planes of the system, as well as to implement the concepts of storage policies and filters. Such design provides flexibility and extensibility to the system, which is a key feature to attract the open source community. No other SDS product in the market includes the extensible filtering mechanisms provided by IOStack.

Second, thanks to the filtering mechanisms of the SDS platform and other client side components, we created SDS data services that provide significant performance gains with respect to state-of-the-art technologies in terms of data transfer reduction, communication costs reduction and application speedup.

And third, the analytics deployment framework, namely Zoe, compared to more low level frameworks like Mesos or Kubernetes, provides the innovation of simplifying deployment and scheduling of heterogeneous analytics applications (Spark, MPI, Tensorflow,..) over Docker containers. Zoe is already providing fast and automated analytics deployments to our use case companies and other companies like KPMG Germany, making their daily management of compute cluster more efficient and time-conserving.

In conclusion, IOStack project offer mechanisms to manage rapidly growing data volumes for the companies that are making the digital transition. In particular, the outcomes of the project can be beneficial for Key Enabling Technologies like Energy-efficient buildings or Bioinformatics among others. In such settings, with large unstructured data pools, these advanced cloud data management technologies will represent an advantage for the European companies and institutions.