Skip to main content
European Commission logo print header

Scalable Similarity Search

Final Report Summary - SSS (Scalable Similarity Search)

Similarity search is the task of identifying, in a collection of items, the ones that are “similar” to a given query item. This task has a range of important applications (e.g. in information retrieval, pattern recognition, statistics, and machine learning) where data sets are often big, high dimensional, and possibly noisy. Prior to this project state-of-the-art methods for similarity search offered only weak guarantees when faced with big data. Either the space overhead was excessive (1000s of times larger than the space for the data itself), or the work needed to report the similar points could be comparable to the work needed to go through all points. As a result, many applications have resorted to the use of ad-hoc solutions with only weak theoretical guarantees.

The project has contributed to strengthening the theoretical foundation of scalable similarity search, and developed novel practical similarity search methods backed by theory. We set out to develop new methods for similarity search that are: 1) Provably robust, 2) scalable to large and high-dimensional data sets, 3) substantially more resource efficient than earlier state-of-the-art solutions, and 4) able to provide statistical guarantees on query answers. Achieving these objectives would allow more flexible, reliable, resource conscious computing systems in a variety of areas.

The main contributions of the project address the first three aspects. A major finding addressing 1) and 2) is that for many important distance measures it is possible to devise search methods that match the efficiency of existing ones, yet are more robust in that they never miss a search result (no "false negatives"). Patents on one of these methods have been issued in Denmark and the United States. Significant progress has been made in the theoretical understanding of 2).

Another advance that we made in the theoretical understanding of 3) is a new framework, generalizing the LSH framework to allow "time-space tradeoffs" where the required space can be scaled to available memory -- a necessary feature for scalability to large data sets. This result strictly improves the so-called entropy method.

We would also like to mention a number of new algorithmic techniques that are now finding applications in algorithms for high-dimensional data. These include "chosen-path filters", "LSH pools", "distance-sensitive hashing", and "confirmation sampling"

Finally, the project has developed several prototypes of software based on the theoretical advances. In particular, the PUFFINN library makes it easier to use state-of-the art LSH methods by not requiring the user to specify parameters that need to be tuned for a given data set. The ANN-Benchmarks framework makes it easier to compare implementations of k-Nearest Neighbor algorithms to the current state-of-the-art.

In an effort to ensure visibility and interaction with the broader community, invited talks on similarity search have been given at conferences on database theory (ICDT 2015), string algorithms (CPM 2015), algorithms theory (ESA 2015, MFCS 2017, WADS 2019), similarity search (SISAP 2015), and machine learning/data mining (ECLM/PKDD 2016).

Public web site: sss.projects.itu.dk