Next gEneration Sequence sTORage

A data series, is a sequence of data points ordered over a dimension (time in the case of time-series). Data series are present in virtually every scientific and social domain: from health care, astronomy and biology, to finance and the internet-of-things. It is not unusual for applications in these domains to involve numbers of sequences (or subsequences) in the order of billions. As a result, analysts are unable to handle the vast amounts of data series that they have to filter and process.

It is noteworthy that common practice in various domains is to use custom data analysis solutions, usually built using higher level programming languages, such as R/Python. Such techniques, while commonly acceptable in small data processing scenarios, are unfit for larger scale data management and exploration. This is because they come in contrast to all previous database research, not taking advantage of indexes, physical data independence, query optimization, and data processing methods designed for scalability.

The reason why existing systems are unsuitable for data series, and as a result, not used, is that they have been designed and optimized for fundamentally different data and query models. Current relational storage layers cannot handle the access patterns that analysts are interested in, without scanning large amounts of unnecessary data or without large processing overhead. Thus, making complex analytics inefficient. Moreover, relational query languages are not an intuitive interface for expressing data series specific queries. In order to alleviate these problems, and exploit the potential of sequential data analytics, specialized data series storage and retrieval systems have to be developed,. Such systems, will allow analysts – across different fields – to efficiently retrieve and manipulate their sequences of interest.

Along these lines, our goal is to develop a specialized data series storage layer that goes beyond relational and array databases, by being able to organize data series in a way that fits specialized data series workloads. Such a system has the potential to optimize data processing pipelines in many different domains as diverse as biology, IoT data analytics, and finance to personalized medicine.

[Experimental evaluation of the state-of-the-art on similarity search] One of the most important queries that analysts need to perform on data series is that of identifying similar sequences. Most of the existing systems evaluate such queries in a brute force way, without utilizing specialized data structures that can significantly improve performance. Our first goal was to understand what is the state of the art in the domain. To achieve that, we performed an in-depth comparison of the most important similarity search data structures. This was the first in-depth experimental evaluation of such methods, performed under a common implementation framework, with very large datasets. This work has been presented in the VLDB conference (one of the top venues in the domain of Database Systems) in 2018, and a follow up work will be presented in the same conference in 2020.

[Novel similarity search methods] Since our goal was to develop an end-to-end data series storage system, we investigated how traditional data structures that are currently used in existing database systems, can be re-engineered to perform similarity search. We effectively re-purposed generic 1-dimensional data-structures like B-Trees and LSM-Trees to be able to perform fast similarity search. This was achieved through a novel technique that allowed us to compress data using sortable summarizations. Such summarizations enabled simple 1-dimensional index structures to effectively store and retrieve multi-dimensional objects through their summaries. This line of work was presented at the ACM SIGMOD 2019 conference, the VLDB 2018 conference and the VLDBJ journal.

[Towards a unified data series management system] Our next topic was that of implementing an end-to-end system. Towards this direction, we designed a specialized data model that is able to capture the specific characteristic of data series (positions, values that change over positions, and static meta-data). We subsequently mapped this model to a column-oriented physical data layout, with a specialized query processing pipe-line. We developed specialized query operators, as well as a domain specific query language. We presented our preliminary work in the North East Database Day in 2018, an important venue that gathers researchers from from major north east US institutions such as MIT, CMU, Brown, Yale, Brandeis, Amherst, Northeastern and Boston University. We are currently performing an experimental evaluation for our system, and we are able to outperform both non-specialized relational systems, as well as competing approaches, providing at least 4x faster performance. We are working towards a paper submission which will be finalized in the following months.

Similarity search is very important as it forms the basis of data mining algorithms as diverse as clustering and classification to outlier detection and frequent pattern mining. One of our goals was to integrate efficient similarity search within our data series management system, and to be able to express and evaluate queries that combine them with various other predicate checks (thresholds, meta-data checks, etc.) This allows scientists to express complicated logic with a few simple lines.

Towards this direction, we developed a specialized database management system, called NestorDB, and a query language – specifically designed for data series – that also incorporates our novel similarity search optimization methods (discussed earlier). Since our approach allows us to perform fast similarity search using of-the-self index structures (commonly found in database systems), we were able to have a concise implementation, where the index structures that we use for simple threshold queries can also be re-used for similarity search as well. This greatly reduces the complexity of maintaining and managing the system. In terms of performance, NestorDB outperforms all competing systems, specialized and non-specialized ones. It can additionally monitor the queries that are issued by the user and re-organize its data layout in order to improve performance.

We are currently in discussions with scientists across various domains for deploying our solution in their data processing pipe-lines. We expect that they will be able to improve the performance of their query processing tasks, as well as the succinctness of their queries.

Finally, we participated at a program at MIT, funded by NSF, called iCorps. During this program, we explored potential commercialization opportunities for the technologies that we have developed. Specifically, we performed more than 10 interviews to potential clients in industry and science, and identified their needs and current pipe-lines. This market study gave as invaluable insights on how our system could be extended in the future. We plan to continue investigating such opportunities later in the future.

Periodic Reporting for period 2 - NESTOR (Next gEneration Sequence sTORage)

Diese Seite teilen

Herunterladen