Skip to main content

HeisenData - Towards a Next-Generation Uncertain-Data Management System

Final Report Summary - HEISENDATA (HeisenData - Towards a Next-Generation Uncertain-Data Management System)

Several real-world applications need to manage and reason about large amounts of data that are inherently uncertain. For instance, pervasive computing applications must constantly reason about volumes of noisy sensory readings, e.g. for motion prediction and human behavior modeling; information-extraction tools can assign different possible labels with varying degrees of confidence to segments of text, due to the uncertainties and noise present in free-text data. Such probabilistic data analyses require sophisticated machine-learning tools that can effectively model the complex correlation patterns present in real-life data. Unfortunately, to date, approaches to Probabilistic Database Systems (PDBSs) have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures: Probabilities are typically associated with individual data tuples, with little or no support for capturing data correlations. This research proposal aims to design and build novel PDBS architectures that support a broad class of statistical models and probabilistic-reasoning tools as first-class system objects, alongside a traditional relational-table store. Our proposed architectures employ statistical models to effectively encode data-correlation patterns, and promote probabilistic inference as part of the standard database operator repertoire to support efficient and sound query processing. This tight coupling of relational databases and statistical models represents a major departure from conventional database systems, and many of the core system components need to be revisited and fundamentally re-thought. The proposed research will attack several of the key challenges arising in this novel PDBS paradigm (including, query processing, query optimization, data summarization, extensibility, and model learning and evolution), build usable prototypes, and investigate key application domains (e.g. information extraction).

Building on our earlier work and results on probabilistic data management [7,8], our research in the HeisenData project has revolved around a number of different axes, including: (1) New probabilistic-data synopses for effective query optimization and processing in PDBSs; (2) Novel PDBS algorithms and architectures integrating state-of-the-art statistical models for Information Extraction (IE) and Entity Resolution (ER); and, (3) Scalable algorithms and tools for large-scale statistical inference and lineage processing in PDBSs. More specifically, our key results over this first period of the project are as follows. (Copies of relevant publications can be downloaded from the HeisenData project web site, at: http://heisendata.softnet.tuc.gr/ )

Effective Synopses for Probabilistic Data.
Data-reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial: Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in PDBSs. We have introduced definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data [5]. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. More recent work [6,12] also introduces probabilistic histograms, which retain the key possible-worlds semantics of probabilistic data, allowing for more accurate, yet concise, representation of the uncertainty characteristics of data and query results. We present a variety of novel techniques for building both optimal and near-optimal probabilistic histograms, each one tuned to a different choice of approximation-error metric. Our experimental studies show that our synopses can accurately capture the key statistical properties of uncertain data, while being much more compact to store and work with than the original uncertain relations.

PDBS Support for Model-based Information Extraction (IE) and Entity Resolution (ER).
Unstructured text represents a large fraction of the world’s data. It often contains snippets of structured information within them (e.g. people’s names and zip codes). IE techniques identify such structured information and label entities in blocks of text; the resulting entities are then imported into a standard database and processed using relational queries. This two-part approach, however, suffers from two main drawbacks. First, IE is inherently probabilistic, but traditional query processing does not properly handle probabilistic data, resulting in reduced answer quality. Second, performance inefficiencies arise due to the separation of IE from query processing. Our work [3,4] has addressed these two problems by building on an in-database implementation of a leading IE model — Conditional Random Fields (CRFs) using the Viterbi inference algorithm. We have developed two different query approaches on top of this implementation. The first uses deterministic queries over maximum-likelihood extractions, with optimizations to push the relational operators into the Viterbi algorithm. The second extends the Viterbi algorithm to produce a set of possible extraction “worlds”, from which we compute top-k probabilistic query answers. We have explored the trade-offs of efficiency and effectiveness between the two approaches using real-life datasets. More recently, we have also explored the in-database implementations of a wide variety of inference algorithms suited to IE, including two Markov-Chain Monte Carlo (MCMC) algorithms, Viterbi and sum-product algorithms [1]. We have given rules for choosing appropriate inference algorithms based on the model, the query and the text, considering the trade-off between accuracy and runtime. Based on these rules, we have proposed a hybrid approach to optimize the execution of a single probabilistic IE query to employ different inference algorithms appropriate for different records. Our techniques can achieve up to 10-fold speedups compared to the non-hybrid solutions proposed in the literature.

Entity Resolution (ER) is the problem of determining if two entities in a data set refer to the same real-world object. ER is a complex and ubiquitous problem that appears in numerous application domains, and there have been several recent advancements by the Machine Learning community on the problem. However, their lack of scalability has prevented them from being applied in practical settings on large real-life datasets. Our work [2] has proposed a principled framework to scale any generic ER algorithm. Our technique consists of running multiple instances of the ER algorithm on small neighborhoods of the data and passing messages across neighborhoods to construct a global solution. We have shown formal properties of our framework and experimentally demonstrated its effectiveness in scaling ER algorithms. Other work [9], considers the ER problem with respect to complex queries over uncertain databases with unmerged duplicates. We have introduced the Entity-Join operator that allows expressing aggregation and iceberg queries over joins between tables with unmerged duplicates (captured through probabilistic linkages) and other database tables. We have also designed an indexing structure for efficient access to the resolution related information, and techniques for efficient evaluation of complex queries. The results of our extensive experimental evaluation on three real world datasets, and the comparison with the results of a sampling-based methodology, verify the effectiveness of our techniques. Finally, our more recent work [14] proposes a deeper integration of the complete information extraction pipeline (including ER and entity canonicalization) with probabilistic query processing, employing sophisticated factor-graph probabilistic models to capture uncertainty.

Scalable Statistical Inference and Lineage Processing.
Other recent and ongoing research explores the use of the Hadoop platform as a tool for large-scale exact statistical inference over massive graphical models [10,13], as well as techniques for managing the lineage of probabilistic data in complex workflows comprising inherently uncertain Machine Learning operators [11].

Our research targets the important area of scalable techniques for managing uncertain data. With the advent of the Internet, massive data sets are everywhere nowadays, and their volumes are constantly growing at unprecedented rates. Interestingly, only a small portion of online data is “clean” and well-structured in a form that can be manipulated in a relational database system; instead, the vast majority of that data is produced from a variety of “noisy”, potentially unreliable sources (e.g. automated data feeds, blog posts), lies in either semi-structured or unstructured data repositories (e.g. websites, financial reports), and is typically riddled with noise and uncertainty. Scalable techniques for uncertain/probabilistic data have become a key issue for some of the major Web companies that are now trying to make sense of Web data (e.g. through IE tools) to enable the next generation of Web search engines and applications. Clearly, our research investigates a problem of very strong interest from both the academic community and the commercial arena.

REFERENCES
[1] Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein. “Hybrid In-DatabaseInference for Declarative Information Extraction”, Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’2011), Athens, Greece, June 2011.
[2] Vibhor Rastogi, Nilesh Dalvi, and Minos Garofalakis. “Large-Scale Collective Entity Matching”, Proceedings of the 37th International Conference on Very Large Databases (VLDB’2011, PVLDB Vol. 4, No. 4), Seattle, Washington, August 2011.
[3] Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein. “Querying Probabilistic Information Extraction”, Proceedings of the 36th International Conference on Very Large Databases (VLDB’2010, PVLDB Vol. 3, No. 1), Singapore, September 2010.
[4] Daisy Zhe Wang, Eirinaios Michelakis, Michael J. Franklin, Minos Garofalakis, and Joseph M. Hellerstein. “Probabilistic Declarative Information Extraction”, Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE’2010), Long Beach, California, USA, March 2010.
[5] Graham Cormode and Minos Garofalakis. “Wavelets and Histograms on Probabilistic Data”, IEEE Transactions on Knowledge and Data Engineering, Vol. 22, No. 8, August 2010 (Invited Paper), pp. 1142-1157.
[6] Graham Cormode, Antonios Deligiannakis, Minos Garofalakis, and Andrew McGregor. “Probabiilistic Histograms for Probabilistic Data”, Proceedings of the 35th International Conference on Very Large Databases (VLDB’2009), Lyon, France, August 2009.
[7] Graham Cormode and Minos Garofalakis. “Histograms and Wavelets on Probabilistic Data”, Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE’2009), Shanghai, China, March 2009. [** ICDE’2009 Best Paper Award **]
[8] Daisy Zhe Wang, Eirinaios Michelakis, Minos Garofalakis, and Joseph M. Hellerstein. “BayesStore: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models”, Proceedings of the 34th International Conference on Very Large Databases (VLDB’2008), Auckland, New Zealand, August 2008.
[9] Ekaterini Ioannou and Minos Garofalakis. “Query Analytics in Uncertain Databases with Unmerged Duplicates”, Manuscript under submission to IEEE Transactions on Knoledge and Data Engineering (TKDE).
[10] Evangelos Vazaios and Minos Garofalakis. “BePadoop: Belief Propagation in Hadoop”, Working paper (in preparation).
[11] Lampros Papageorgiou, Evangelos Vazaios, and Minos Garofalakis. “Lineage Processing in Uncertain Operator Workflows”, Working paper (in preparation).
[12] Graham Cormode, Antonios Deligiannakis, Minos Garofalakis, and Andrew McGregor. “Probabiilistic Histogram Synopses for Probabilistic Data”, Working paper (in preparation for submission to ACM Transactions on Database Systems (TODS)).
[13] Evangelos Vazaios. “BePadoop: Large-Scale Exact Belief Propagation in Hadoop”, M.Sc. Thesis, Technical University of Crete, 2013.
[14] Ekaterini Ioannou and Minos Garofalakis. “Probabilistic Query Processing over Information Extraction Pipelines”, Working paper (in preparation).