High-Performance Indexing for Emerging GPU-Coupled Databases

"An unflagging trend over the past and probably next decade is the cheap availability of more and more random access memory (RAM). This is a pivotal difference to even a decade before, as it enables the computation *in memory* of expansive datasets, even on the scale of terabytes. Moreover, when computation can be done in memory, there are many more opportunities to exploit diverse processors concurrently to accelerate computation; suddenly, one can leverage highly parallel accelerators such as an Intel Xeon Phi or general purpose graphics processing cards (GPGPU's) and the inherent parallelism already present in any modern (i.e. super-scalar, vector-register-equipped, multi-core) computer.

There are (at least) two *very* good reasons to target all of these opportunities for parallelism at once:
(a) targetting diverse parallel devices produces more generic algorithms, data structures, and principles which can apply even to as-yet-uninvented parallel devices; and
(b) one can use the *entire* computer to accelerate computation, rather than just heavily optimising one component while the others idle.

At the same time, the ubiquity of mobile devices has generated an explosion of spatio-temporally-annotated textual content, such as geo-located tweets, ""check ins"" at local establishments, or news articles with spatial context. Supporting complex analytics over so much complex, heterogeneous data requires sophisticated data structures to help filter information *and* the use of parallelism to manage scale.

Thus this project. The overall objective is to design spatio-temporal-textual indexes that are simultaneously architecture-conscious; i.e. can *fully* exploit the underlying hardware for maximal throughput---and suitable for multiple data-parallel platforms; i.e. can be ported, transferred, and simultaneously used by multiple components of the compute ecosystem. Such data structures would achieve the advantages of targetting multiple parallel devices and help manage the explosive growth in spatio-temporal-textual content."

"By the time peer review completes, this project will have produced over half a dozen research papers in just two years. We chronologically summarise below the two most important results from each year.

# Vectorised phrases
A challenge with analysing text is that words alone often do not provide sufficient context. For example, the word ""mount"" could refer to boarding a horse or any number of mountains in the English-speaking world. However, most text applications that process text simply convert every word into a unique numeric representation, and the context is lost. We devised a method for encoding variable-length phrases into just 128 bits (i.e. the length of a _narrow_ vector register) and an algorithm to mine the variable-length phrases with just a short sequence of instruction-parallel operations, using the vector registers to process up to four possible phrases simultaneously.

# Clipped bounding boxes
In collaboration with Human Brain Project researchers in Switzerland, we devised a method to accelerate classic spatial search. In general, objects are represented with the smallest (hyper-)rectangle that encloses them, because of the simplicity and efficiency of this representation. However, by focusing only on the corners of these _bounding boxes_, one can ""clip away"" excess space with minimal representation overhead. This tighter approximation of the underlying content improves query performance, because it prevents falsely concluding that an object lies in the empty corners of the bounding boxes. As a fundamental improvement to basic spatial search, this implies immediate improvements to spatio-temporal-textual search, too.

# Vectorised trees
The R-tree (as its well-known variants) is a classic data structure for multi-dimensional spatial search. It groups together nearby objects and represents the entire group with their approximating (clippable!) bounding box. Classically, at query time, each element of these groups is processed one-by-one; however, with ""padding"" it is possible to process up to 8 bounding boxes simultaneously by employing auto-vectorisation on machines with 512-bit vector (a.k.a. SIMD) registers. We showed through extensive experiments that, irrespective of the variant of R-tree chosen, significant query time improvements can be made in this fashion.

# Decaying temporality
It is an eternal question, whether time is fundamentally different from space. Yet most queries treat space as more general: it is enough to find the nearest spatial objects, but temporal boundaries are taken as fixed and rigid. We demonstrated that treating time as a ""decaying"" concept, that something new is more interesting, but something old isn't ""out of boundary,"" can identify more interesting textual concepts from a corpus of text than treating time as a (clippable!) bounding box."

As most of the results in this project have not yet completed the constructive peer review process at the conclusion of the project's two-year scope, much of the impact is yet to be realised. However, given the significance of the results, we anticipate the following within the forthcoming years:

* The use of vector (SIMD) registers to accelerate in-memory spatial data structures, such as R-trees, will become ubiquitous. Many hardware manufacturers, such as Intel, are steadily increasing the width of their vector registers, so over time, these vectorisation results will provide a repeated doubling of in-memory spatial query performance even without additional algorithmic innovation.

* Continuous patterns of text will be replaced with _wider_ representations that can optionally provide additional context without additional context, thanks to the employment of vectorisation. As a result, there will be far more flexibility in text analytics to leverage context *when it helps*.

* The quest for multi-device, heterogeneous data structures is nearer to completion. Innovative engineers can integrate many of the results in the project to provide vectorised solutions for both spatial and textual data. Moreover, as SIMD is a more constrained platform than GPGPU's, many of these ideas should port to the latter.

Periodic Reporting for period 1 - CoupledDB (High-Performance Indexing for Emerging GPU-Coupled Databases)

Condividi questa pagina

Scarica