Skip to main content

REINFORCEMENT LEARNING FOR PREDICTIVE FAILURE-DETECTION AND PROACTIVE DATA MANAGEMENT ON DIGITAL STORAGE SYSTEMS

Periodic Reporting for period 1 - PREFAIL (REINFORCEMENT LEARNING FOR PREDICTIVE FAILURE-DETECTION AND PROACTIVE DATA MANAGEMENT ON DIGITAL STORAGE SYSTEMS)

Reporting period: 2020-10-01 to 2022-02-28

The ever-increasing amount of data generated and consumed to serve the modern and next-generation of end-user and business applications necessitate a physical infrastructure expansion of an unprecedented magnitude. Despite the development of data storage abstractions for cloud applications, such as distributed file systems, this physical infrastructure, e.g. data warehouses and data centers, is still relying on individual storage devices, such as Hard Disk Drives and Solid State Drives.

In these environments employing thousands of individual storage devices, hardware failures are the norm. Thus, maintenance costs is the primary concern of operators. Furthermore, data loss due to device failures can be time consuming and costly to mitigate or tackle for both cloud data storage businesses as well as other end-users of all types.

Algolysis Ltd has architected and developed DriveNest (https://www.drivenest.com) a distributed storage device monitoring and failure prediction service available to anyone via the Internet. The goal of this innovative service is to enable users to monitor over the internet their storage devices at the physical layer, while having at their disposal an algorithmically-backed failure prediction notification system that notifies them in advance of a potential catastrophic data loss.

On the one hand, the PREFAIL project offered the opportunity to Algolysis to recruit an Innovation Associate with a specialization in Machine Learning to assist in the design and implementation of Machine Learning algorithms for proactively identifying soon-to-fail storage devices. On the other hand, PREFAIL enabled an experienced scientist to join the team at Algolysis Ltd for one year and gain additional experience and exposure in the industry.

The specific objectives of the project were: (a) to devise a ML-driven failure prediction engine to be coupled with the DriveNest service of Algolysis Ltd, while (b) the Innovation Associate participated in parallel in a tailored training program to strengthen his abilities and gain experience in the process of innovation.
During this project the Innovation Associate has joined the team of Algolysis Ltd and studied the state-of-the-art on failure prediction for storage devices. The IA has been provided with a large dataset to experiment, train and validate different Machine Learning models. He has studied and extended the company’s non-parametric supervised learning methods and has also trained and compared them to a series of artificial neural networks. All methods have been incorporated into the Failure Prediction Engine of the company and interfaced with the DriveNest monitoring platform. The results obtained and the implementations form the basis for the SME to advance this innovation to a TRL close to its full commercialization.

In parallel the IA has been provided access to a range of online courses and has followed a tailored training programme on a variety of topics related to innovation and organized by the EU. Algolysis Ltd has also performed a hands-on personalized training programme to specifically facilitate the IA’s transition into the company, to increase his grant writing capabilities and his technical skills on software engineering, integration and deployment.
In the PREFAIL project the majority of state-of-the-art techniques for failure prediction have been implemented and a variety of improvements have been performed to tackle different shortcomings of previous techniques, such as high level of noise, class imbalance, concept drift, and sample heterogeneity.

The team has managed to further advance and experiment with both ensemble methods, as well as hybrid learning techniques. It has become evident from thorough experimentation that models offering a low False Alarm Rate (i.e. correctly identify failed devices) with the highest possible Failure Detection Rate (FDR) are the most promising, despite FDR being relatively low. This was deemed important since these predictive models enable generating alarms that are accurate when they occur, despite being unable to detect all failures.

The company anticipates that commercialization of its predictive system and failure prediction pipeline will follow this project and that will enable it to launch a service available to anyone, from data centre operators and warehouses to home end-users, as well as third party application and system monitoring developers. This service can play a key role to sibling products and services, such as proactive backup, data migration and data loss mitigation.
Project logo