Charting a New Horizon of Big and Fast Data Analysis through Integrated Algorithm Design

Informations projet

BigFastData

N° de convention de subvention: 725561

DOI

10.3030/725561

Projet clôturé

Date de signature de la CE 15 Juin 2017

Date de début 1 Septembre 2017

Date de fin 31 Decembre 2023

Financé au titre de

EXCELLENT SCIENCE - European Research Council (ERC)

Coût total

€ 2 472 752,00

Contribution de l’UE

€ 2 472 752,00

2 472 752,00

Coordonné par

ECOLE POLYTECHNIQUE
France

Periodic Reporting for period 4 - BigFastData (Charting a New Horizon of Big and Fast Data Analysis through Integrated Algorithm Design)

Période du rapport: 2022-03-01 au 2023-12-31

This proposal developed an algorithmic foundation for big and fast data analytics that supports not only the scale of data analytics, but also perpetual, low-latency processing for a broad set of analytical tasks. Prior to this project, data analytics systems fell short due to the lack of support of complex analytics with scale, low latency, and strong guarantees of user performance requirements. More specifically, our project addressed the following topics:

Optimization of Cloud Analytics: Our first focus was a principled optimization system for supporting big and fast data analytics for numerous cloud users. As a big data system has hundreds of parameters that control the degree of parallelism, memory allocation, granularity of scheduling, etc., different settings of them lead to drastically different performance regarding user objectives like latency, throughput, and cloud cost. It is crucial to find the optimal configuration of this large set of parameters that best meets the user objectives. To this end, we designed the first multi-objective cloud analytics optimizer can automatically determine all the tunable parameters of an analytics job, collectively called a job configuration, based on user objectives. While multi-objective optimization (MOO) has been studied in the theory community, we addressed the distinct challenges in designing an optimizer for cloud analytics, i.e. highly diverse workloads and system behaviors, large numbers of parameters, and the requirement of low solving time to avoid delays in starting a job.

Stream Analytics: Our second focus was new algorithms for eXplainable Anomaly Detection (XAD) on data streams, which finds application in numerous use cases such as safeguarding financial systems and numerous IT systems. Yet, the literature lacked an algorithmic solution that adapts to complex, evolving patterns, provides human-readable, actionable insights, and does so at stream speed. Our project devised a new algorithmic framework to meet these challenges.

Exploratory Analytics: We further developed algorithms for a new form of big data analytics with human-in-the-loop: the human user is interactively exploring a large data repository and expects the system to provide support for efficiently and effectively achieving user data exploration goals. We proposed a large suit of algorithms to expedite data exploration on large databases and RDF graphs.

A Multi-Objective Cloud Optimizer. Our first focus was a principled multi-objective optimizer that automatically determines all the tunable parameters of an analytics job, collectively called a job configuration, based on user objectives.
1) Modeling of Objectives: Our first activity was developing predictive models for different user objectives, for which traditional modeling methods failed to adapt to highly diverse cloud workloads and system behaviors. For SQL queries, we introduced a new approach that embeds query plans into numerical vectors using a graph transformer network (GTN), and further combines such embeddings with other information such as the job configuration, machine states, and hardware to predict the end objectives.
2) Multi-Objective Optimization (MOO) Algorithms: We developed a new theoretical framework that transforms a MOO problem into a series of constrained optimization problems, each of which guarantees to return a Pareto optimal point, and further a suite of incremental, uncertainty-aware, and efficient Progressive Frontier algorithms. Our work further extended to large-scale serverless computing, whose resource optimization problem involves the assignment of many parallel instances to 1000’s machines and resource allocation to each parallel instance. To handle this large set of parameters, we developed new scalable MOO algorithms that reduce both latency and cost by determining the best assignment of instances to machines and instance-specific resource plans.

Stream Analytics. To support explainable anomaly detection on data streams, we devised a brand new algorithmic framework that leverages deep learning (DL) to detect diverse patterns, yet overcomes the non-interpretability of DL by returning human-readable, actionable insights, and custom dimensionality reduction to expedite anomaly detection while enabling explainability.

Exploratory Analytics. For database exploration, we proposed an “explore-by-example” approach cast in a principled active learning framework. We further explored the properties of important classes of database queries to enable the design of new algorithms, including a dual space model (DSM) and its factorized version over high-dimensional space, which significantly expedited the convergence of data exploration. For RDF exploration, we proposed a new one-pass algorithm for efficiently evaluating a lattice of aggregates and a novel early-stop technique for pruning uninteresting aggregates.

A Multi-Objective Cloud Optimizer. While multi-objective optimization (MOO) has been studied in the theory community, existing algorithms are too slow for cloud analytics systems due to their large parameter space and the stringent solving time constraint to avoid delays in starting each analytical job. Our proposed Progressive Frontier algorithms improve efficiency via incremental computation, fast approximation, and parallel processing. Evaluation using benchmarks shows that our algorithms provide a 2-50x speedup over existing MOO methods while offering good coverage of the Pareto front. For large scale serverless computing, we determine the assignment of parallel instances to machines, using fast approximate methods to handle the large assignment matrix, and further derive instance-level resource plans to reduce both latency and cost, using a scalable MOO algorithm across parallel instances. Evaluation using AliCloud production workloads shows that our optimizer could reduce 37-72% latency and 43-78% cost simultaneously, compared to the current scheduler, while running in 0.02-0.23s. The above results present a breakthrough in translating the MOO theory to an efficient, effective multi-objective optimizer for large-scale cloud analytics.

Stream Analytics. For explainable anomaly detection (XAD), our work was the first to integrate deep learning for detecting complex anomaly patterns, yet be able to return human-readable explanations that are concise and consistent, and further employ human-interpretable dimensionality reduction to expedite anomaly detection while retaining explainability. We further developed Exathlon, the first comprehensive public benchmark for this topic and provided the first experimental study evaluating state-of-the-art XAD techniques.

Exploratory Analytics. While our database exploration is cast in the active learning (AL) framework, existing AL methods suffer from slow convergence when running on large databases. Our new approach leverages the properties of user interest patterns and enables highly efficient exploration, outperforming existing AL methods with 0.88x-22.14x higher accuracy for benchmark queries on large databases, while maintaining interactive speed. Our RDF graph exploration, with one-pass algorithm for evaluating a lattice of aggregates and probabilistic pruning methods, offer up to 2.9xspeedup over existing algorithms, as well as scalability as the data size and complexity grow.

Three scientific goals of big fast data analysis

Periodic Reporting for period 4 - BigFastData (Charting a New Horizon of Big and Fast Data Analysis through Integrated Algorithm Design)

Partager cette page Partager cette page sur les réseaux sociaux

Télécharger Télécharger le contenu de la page