Community Research and Development Information Service - CORDIS


SSS Report Summary

Project ID: 614331
Funded under: FP7-IDEAS-ERC
Country: Denmark

Mid-Term Report Summary - SSS (Scalable Similarity Search)

Similarity search is the task of identifying, in a collection of items, the ones that are “similar” to a given query item. This task has a range of important applications (e.g. in information retrieval, pattern recognition, statistics, and machine learning) where data sets are often big, high dimensional, and possibly noisy. Prior to this project, state-of-the-art methods for similarity search offered only weak guarantees when faced with big data. Either the space overhead was excessive (1000s of times larger than the space for the data itself), or the work needed to report the similar points could be comparable to the work needed to go through all points. As a result, many applications have resorted to the use of ad-hoc solutions with only weak theoretical guarantees.

The project contributes to strengthening the theoretical foundation of scalable similarity search, and develops novel practical similarity search methods backed by theory. The objective is to produce new methods for similarity search that are: 1) Provably robust, 2) scalable to large and high-dimensional data sets, 3) substantially more resource efficient than earlier state-of-the-art solutions, and 4) able to provide statistical guarantees on query answers. Achieving these objectives will allow more flexible, reliable, resource conscious computing systems in a variety of areas.

In the first 30 months of the project we have worked primarily on the first three aspects. A major finding addressing 1) and 2) is that for high-dimensional Hamming space it is possible to devise search methods that match the efficiency of existing methods, yet are more robust in that they never miss a search result (no "false negatives"). The host institution has filed patent applications on the underlying method. Also, we have made a notable advance towards practical constructions of so-called expander graphs, mathematical objects that in many cases can transform randomized algorithms with a certain failure probability into deterministic ones that never fail. We expect this to impact several areas within the project. Significant progress has been made in the theoretical understanding of 2). In the case of similarity joins, where a large number of similarity searches are processed in batch, we have described an algorithmic approach that makes significantly better use of the memory hierarchy than existing methods. This also means that scalability to parallel and distributed systems is within reach.

Another significant advance that we made in the theoretical understanding of 3) is a new framework, generalizing the LSH framework to allow time-space tradeoffs -- a necessary feature for scalability to large data sets. This result strictly improves the so-called entropy method. Concurrently with our efforts, significant advances have been made by other groups using so-called data dependent locality-sensitive hashing. This further increases the likelihood that the project objectives can be met, though many challenges remain.

In an effort to ensure visibility and interaction with the broader community, invited talks on similarity search have been given at conferences on database theory (ICDT 2015), string algorithms (CPM 2015), algorithms theory (ESA 2015), similarity search (SISAP 2015), and machine learning/data mining (ECLM/PKDD 2016).

Public web site:

Reported by

IT University of Copenhagen
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top