Periodic Reporting for period 2 - BigFastData (Charting a New Horizon of Big and Fast Data Analysis through Integrated Algorithm Design)
Reporting period: 2019-03-01 to 2020-08-31
This project considers three pillars:
1. Parallelism: There is a fundamental tension between data parallelism (for scale) and pipeline parallelism (for low latency) when data exceeds the memory size. We propose new approaches to integrate different forms of parallelism to maximize performance on large datasets and high-volume data streams.
2. Analytics: The literature lacks a large body of algorithms for complex order-related analytics on large datasets. We propose new algorithmic solutions to enable critical temporal and sequence analytics under different forms of parallelism.
3. Optimization: When running analytics jobs, today’s big data systems are best effort in nature. To respond to diverse performance goals and budgetary constraints of the user, we develop a principled optimization framework that suits the new characteristics of cloud data analytics and explores the best way to meet user objectives.
Integrated algorithm design across these three efforts will lay a solid foundation for big and fast data analysis, enabling a new integrated parallel processing paradigm, algorithms for critical order-related analytics, and a principled optimizer with strong performance guarantees. It will also broadly enable (i) accelerated information discovery in emerging domains such as genomics, (ii) social benefits of early, well-informed decisions, and (iii) economic benefits of reduced user payment for cloud computing and “greener cloud computing” where resources are utilized fully and efficiently to fulfill global data analytical needs.
A Principled Optimization Framework. A main focus of our work is to develop a principled optimization framework for supporting data analytics in the cloud based on user-specified objectives. Our work presents a data analytics optimizer that can automatically determine a cluster configuration with a suitable number of cores as well as other system parameters that best meet user objectives. At the core of our work is a principled multi-objective optimization (MOO) approach that computes a Pareto optimal set of job configurations to reveal tradeoffs between different user objectives, recommends a new job configuration that best explores such tradeoffs, and employs novel optimizations to enable such recommendations within a few seconds. In addition, the optimizer requires a predictive model for each user-specified objective. Our work explores latest Deep Learning techniques to automatically learn a model for each objective of a given task from the runtime behavior of the task.
Sequence Analytics. We further design efficient algorithms for complex sequence analytics, such as those used in genome data analysis, which are known to be hard to parallelize. For such sequence analytics, naive data partitioning methods can not guarantee that the analysis on each partition has all the required data. To ensure correctness of parallel analysis, our work develops a Genome Data Parallel Toolkit (GDPT) that provides data partitioning and shuffling strategies that suit the data access patterns of genome analysis programs. In particular, we consider two complex variant detection algorithms that perform window-based micro-assembly to detect haplotypes. We developed a novel approach that employs overlapping data partitions and provable ""safety"" conditions for utilizing such partitions such that the output of the parallel program is consistent with the output of the original serial program, hence ensuring the quality of the parallel computing result."
A Principled Optimization Framework. A successful design of a data analytics optimizer should meet two requirements: (1) The optimizer can indeed improve the system performance to better meet the user objectives; (2) the optimizer can make a recommendation on demand for each user job within a few seconds, hence not delaying the start of a scheduled job. Evaluation results using large benchmarks show that our multi-objective optimization (MOO) techniques provide a 2-50x speedup over existing MOO methods in making a recommendation, and outperform Ottertune, a state-of-the-art performance tuning system, by 26%-49% reduction of running time of the TPCx-BB benchmark while adapting to different user preferences on multiple objectives.
Sequence Analytics. For genomic data analysis, we tackled two complex window-based variant detection algorithms that did not have a parallel solution that could guarantee the quality of the parallel output. Evaluation results show that our parallel algorithm can outperform Spark HC (HaplotypeCaller), a state of the art parallel algorithm for HaplotypeCaller, by 2.7x in running time while providing 4 orders of magnitude improvement in accuracy of parallel output (8 different results from the serial output out of 4,931,429 detected variants, while Spark HC produces parallel output with 84,869 differences from the serial output). Given that variant detection results are often used in further analysis to assist in diagnosis and treatment, our parallel algorithms have significantly advanced the state of the art in both processing speed and quality of answers, where the latter is crucial for realizing the vision of “precision medicine”.
In the future, we will advance our project by addressing the following tasks:
Optimization: We will improve our work for automatically learning a model for each user objective and demonstrate the accuracy of our learned models against existing modeling techniques. In addition, we will combine these learned models with our multi-objective optimizer to demonstrate the net benefits of such a system for both the analytics user and the cloud service provider.
Analytics: We will study temporal analytics, with a focus on explainable anomaly detection on high volume data streams. We aim to design new techniques to address challenges such as detecting multiple types of anomalies, discovering human-readable explanations for detected anomalies, and the ability to achieve both objectives on high-volume data streams.
Systems: We aim to develop advanced systems support for complex analytics, such as our explainable anomaly detection pipeline on data streams. We will explore techniques such as operator fusion and intelligent data reuse, thereby pipelining execution across the different models developed through the pipeline, while being able to engage multiple machines to explore data parallelism.