Periodic Reporting for period 1 - PREFAIL (REINFORCEMENT LEARNING FOR PREDICTIVE FAILURE-DETECTION AND PROACTIVE DATA MANAGEMENT ON DIGITAL STORAGE SYSTEMS)
Reporting period: 2020-10-01 to 2022-02-28
In these environments employing thousands of individual storage devices, hardware failures are the norm. Thus, maintenance costs is the primary concern of operators. Furthermore, data loss due to device failures can be time consuming and costly to mitigate or tackle for both cloud data storage businesses as well as other end-users of all types.
Algolysis Ltd has architected and developed DriveNest (https://www.drivenest.com) a distributed storage device monitoring and failure prediction service available to anyone via the Internet. The goal of this innovative service is to enable users to monitor over the internet their storage devices at the physical layer, while having at their disposal an algorithmically-backed failure prediction notification system that notifies them in advance of a potential catastrophic data loss.
On the one hand, the PREFAIL project offered the opportunity to Algolysis to recruit an Innovation Associate with a specialization in Machine Learning to assist in the design and implementation of Machine Learning algorithms for proactively identifying soon-to-fail storage devices. On the other hand, PREFAIL enabled an experienced scientist to join the team at Algolysis Ltd for one year and gain additional experience and exposure in the industry.
The specific objectives of the project were: (a) to devise a ML-driven failure prediction engine to be coupled with the DriveNest service of Algolysis Ltd, while (b) the Innovation Associate participated in parallel in a tailored training program to strengthen his abilities and gain experience in the process of innovation.
In parallel the IA has been provided access to a range of online courses and has followed a tailored training programme on a variety of topics related to innovation and organized by the EU. Algolysis Ltd has also performed a hands-on personalized training programme to specifically facilitate the IA’s transition into the company, to increase his grant writing capabilities and his technical skills on software engineering, integration and deployment.
The team has managed to further advance and experiment with both ensemble methods, as well as hybrid learning techniques. It has become evident from thorough experimentation that models offering a low False Alarm Rate (i.e. correctly identify failed devices) with the highest possible Failure Detection Rate (FDR) are the most promising, despite FDR being relatively low. This was deemed important since these predictive models enable generating alarms that are accurate when they occur, despite being unable to detect all failures.
The company anticipates that commercialization of its predictive system and failure prediction pipeline will follow this project and that will enable it to launch a service available to anyone, from data centre operators and warehouses to home end-users, as well as third party application and system monitoring developers. This service can play a key role to sibling products and services, such as proactive backup, data migration and data loss mitigation.