Skip to main content

ScaleWorks – Parallel File System Support for Rapid Prototyping of Scalable Applications

Final Report Summary - SCALEWORKS (Parallel File System Support for Rapid Prototyping of Scalable Applications)

- Overview of main results

Result #1: SCALEWORKS achieved rapid prototyping of a scalable, highly-available, and efficient email service via synthesis of scripted application logic and an application-specific parallel file system based on a reusable eventually-consistent replication primitive.

The SCALEWORKS project explored the feasibility and utility of novel infrastructural services in providing high-level abstractions or data structures as the fundamental storage infrastructure. SCALEWORKS demonstrated that a key missing piece in past proposals such as Boxwood and scalable distributed data structure (SDDS) is primitives that explore the space of data consistency semantics. Specifically, it showed that a reusable eventually-consistent replication primitive is such a missing piece that brings us closer to the vision of a toolbox of universal abstractions to support arbitrary scalable application development.

SCALEWORKS research (presented in detail in paper by Koromilas and Magoutis at DAIS'11 conference) considered the following questions: (a) Does a reusable eventually-consistent replication primitive (Cassandra) significantly reduce development effort of a scalable, highly-available Email server compared to the effort expended in a project aiming to build a similar system in the past?; and (b) Does the resulting system exhibit the scalability and availability expected of a robust scalable e-mail service?

SCALEWORKS made the following contributions in this area: (1) Designed and implemented a fully functioning scalable, highly-available e-mail service based on synthesis of interoperable components (extensible high-level development interfaces, Cassandra storage system). (2) Demonstrated the speed of prototyping that such a software engineering approach allows. Specifically, it took us a few tens of lines of Python code to implement a working prototype compared to the 41,000 lines of C++ code it took for a system with similar design principles a decade ago. (3) Evaluated the configuration and tuning of the underlying storage engine for the targeted application, exhibiting the scalability, availability, and manageability properties of our rapidly-prototyped system.

Result #2: SCALEWORKS developed a novel reusable component for resource discovery and management, deployed off of the Hadoop file system (HDFS) and using a combination of approaches to understand dependencies in underlying Cloud infrastructures and exposing them to applications via a query interface. Through this component, application can acquire knowledge of several properties of the underlying resources, especially in a Cloud deployment scenario where virtualisation purposely hides details with regards to the relationship of virtual resources with the underlying physical infrastructure, and thus adapt and adaptive and better perform policies such as task/data assignment and migration.

SCALEWORKS research (presented in detail in paper by Kitsos, Papaioannou, Tsikoudis, and Magoutis at NOMS'12 conference) contributed: (a) two methodologies to perform distributed discovery of virtual machine (VM) collocation information in a variety of Cloud environments. Both methodologies are implemented by appropriately extending HDFS and readily available to applications via a query interface; (b) a novel, lightweight, privacy-preserving, easy to deploy Cloud management API that results into minimal overall discovery time; and (c) evaluation of the methodologies on three commercial Cloud providers (Flexiant, Rackspace, Amazon) providing insight into their VM allocation characteristics, as well as our own experimental Eucalyptus setup.

Result #3: SCALEWORKS developed a new approach for rapidly improving the robustness of an existing scalable file system (HDFS), by designing and implementing a strongly-consistent replication primitive as a re-targetable building block that can then be retrofitted with minimal engineering effort into the scalable file system.

SCALEWORKS research (presented in detail in the MSc thesis of Giorgos Saloustros, submitted to the computer science department, University of Crete) developed a novel replication primitive (ZKRL) by extending a general-purpose open-source coordination service (Apache Zookeeper) to maintain consistent data replicas across read and append operations. By cleanly replacing the ad-hoc replication protocol of the Hadoop scalable file system (HDFS) SCALEWORKS achieves stronger overall data replication semantics. The resulting system thus becomes a candidate for a wider range of applications than previously possible. This work further highlights that this improvement need not incur a performance penalty: Our results showed that we can achieve performance close to peak I/O (network or disk) capabilities when appending to replicated logs over 1-Gb/s and 10-Gb/s networks.

Result #4: SCALEWORKS developed a new mechanism to achieve fault-tolerance in a scalable stream processing system via use of a general-purpose parallel file system (PVFS).

SCALEWORKS research (presented in detail in the papers by Sebepou and Magoutis, DSN'11 conference, and Stamatakis, Grammenos, and Magoutis, AmI'11 conference) created a novel fault-tolerance mechanism (Continuous Eventual Checkpointing, or CEC) that builds on parallel file system support (in our case study, PVFS) to checkpoint operator state and thus rollback and then replay logs, achieving recovery and high availability in the face of server failures. Synthesis of this new PVFS-supported reusable component with an existing streaming middleware system (through extension of the system's operator set, without any core system modifications) results in rapid prototyping of fault-tolerant data streaming applications, an otherwise complex undertaking that can take significant engineering effort to prototype and test.
– Conclusions

The SCALEWORKS project demonstrated a set of methodologies for designing and rapidly prototyping distributed applications through an approach that synthesises reusable system components with support from extensible parallel file system technologies supported by system optimisations and high-speed networking techniques. The SCALEWORKS project showed that it is possible to achieve rapid prototyping of scalable applications without sacrificing efficiency.

The SCALEWORKS ecosystem (shown in attached figure) currently comprises a number of reusable distributed systems components (strongly-consistent replication, eventually-consistent replication, highly available metadata, scalable logging, scalable locking, and resource discovery) that can be combined with minimal engineering effort into applications from diverse domains such as continuous data streaming, E-mail services, and parallel databases. SCALEWORKS used three parallel file systems (PVFS, HDFS, and an application specific scalable file system) as carrier platforms on which these reusable components can be incubated and through which they can become available to applications. As envisioned in the SCALEWORKS proposal, this proved to be a practical approach whose success is based on the ubiquity of the file system application programmer's interface and the fact that parallel file systems already provide basic support for a variety of scalability issues which need not be implemented from scratch.

SCALEWORKS covered a very large portion of the research space described in its original proposal in terms of both number and type of reusable components, parallel file systems, and applications. However, further work remains to be done to turn this research into a service or a product that has wide market appeal. One issue, which is subject of ongoing research work, is fully automating the component synthesis to prototype applications to a desired specification. The current approach operates manually and thus requires the involvement of engineers that understand the design of applications as well as the semantics of the reusable components available. Furthermore, refining the approach to require no 'glue code' at all (a varying but small amount of such code was necessary to synthesise components in all our case studies) would further reduce the amount of effort SCALEWORKS requires from the engineering team it supports and even further improve prototyping time. We overall believe that the SCALEWORKS project successfully demonstrated its scientific objectives as well as it justified the need for continued exploitation of the developed technologies.
- Socio-economic impact

The need for rapidly designing, developing, and prototyping scalable applications has broad scientific and socio-economical ramifications. Our e-business driven economy demands the lowering of the financial bar and associated risk that separates the evolution of solid business ideas into successful enterprises. Although we have seen evidence of business ventures that successfully developed scalable applications able to grow as fast as their business, such efforts required large initial investments and long development times. Furthermore, enterprises require system support to reap the benefits of investment in cost-effective multi-core server and high-speed Ethernet infrastructure in mission-critical applications such as enterprise data mining. Finally science and engineering programs must ensure that computer scientists are trained to cope with emerging scalable network storage challenges.

SCALEWORKS rose up to the aforementioned challenges, addressing them through its support for rapid prototyping of scalable applications and system support for cost-efficiency. SCALEWORKS showcased the methodology that can lower the financial bar and associated risk for all enterprises, but particularly for startups and small and medium enterprises (SMEs) where efficiency and time-to-market are critical factors for success. The SCALEWORKS methodology of rapidly prototyping scalable applications provides a competitive advantage by allowing them to enter a market sooner while producing distributed software at a lower cost. Additionally, education and mentoring activities undertaken during the project trained about 100 computer scientists on core SCALEWORKS topics as well as related research areas. Standardised course material is now freely available to be replicated for teaching at other university courses or seminars.

Throughout its duration SCALEWORKS developed a clear and actionable plan for the use and dissemination of the project results. Efforts during the project involved the publication of four scientific papers at top international conferences, the creation of a Web site ( describing the project and its objectives and providing links to the aforementioned publications, and networking activities with leading European researchers. We expect that the stream of SCALEWORKS research results will endure, further deepening our understanding of this scientific area and bringing closer the benefits of those results to the European business and academic communities.