Skip to main content

High-performance data-centric stack for big data applications and operations

Periodic Reporting for period 2 - BigDataStack (High-performance data-centric stack for big data applications and operations)

Reporting period: 2019-07-01 to 2020-12-31

Big data frameworks exploit underlying infrastructure management systems that have not been implemented in a “big data context”. BigDataStack introduces a frontrunner data-driven system ensuring that resources management will be efficient and optimized for data operations. The system bases all decisions on data aspects affecting the interdependencies between storage, compute and network resources. Data as a Service promotes automation and ensures that the provided data are meaningful, of value and fit-for-purpose through approaches for cleaning and adaptable storage. Unique seamless data analytics is realized across multiple data stores, along with advanced modelling techniques defining flexible schemas that can be exploited across processing frameworks. The dimensioning workbench facilitates data-focused application analysis and prediction of the required resources. The data toolkit provides an open and extensible environment for data scientists to specify their preferences and constraints that will be utilized for resources management.
• Overall architecture including the analysis of specific cross-layer topics and information flows based on analysed application, user and technical requirements.
• Container platform innovation with the integration of OpenShift on top of OpenStack, enriched with the Kuryr-Kubernetes as the main OpenShift CNI, with remarkable infrastructure speed up.
• Innovative realization engine, facilitating the registration, configuration, deployment and management of complex containerized applications in Kubernetes-based environments. It provides a standardized mechanism for grouping components into larger conceptual applications and then defining complex multi-stage alteration actions upon them.
• Runtime adaptation sense-think-act system, utilizing resource and data services tracking for intelligent decision making.
• Seamless data-as-a-service enabling data provisioning and access across different data stores, and real-time analysis through full SQL compatibility across stores.
• General availability to 4 IBM commercial services of a leading data skipping technology, addressing also SQL JOINs.
• Process modelling framework with AutoML through process mapping to facilitate the needs of business analysts for high-level workflows.
• Overall framework for dimensioning, providing resource estimates for application and data services considering the workloads and the data operations for each service in a fully automated way.
• Automatic estimation of applications needs by recommending resource configurations based on application and resource features in addition to QoS objectives set by the application owner, reducing the need for 3rd-party benchmarking services to perform this role.
• New BetaRecSys framework for item recommendations. Initially developed for a project use case, but it now open source and further developed in on-going projects (e.g. H2020 INFINITECH).
• Outreach activities including scientific publications, organization of events, contributions to BDVA Innovation Marketplace, interactions with EOSC-DIH to sign a collaboration agreement and definition of an adoption roadmap.
• Exploitation plans by identifying: 21 assets, 3 use cases, 8 MVPs, and groups of results (for joint exploitation), actual exploitation of components (e.g. data skipping, infrastructure management), commercialization paths for several outcomes, filling of 3 patents and concluding a technology transfer action.
• Open source contributions: Spark, Kubernetes, OpenShift Installer, Cluster Network Operator, Cluster API Provider OpenStack, Kuryr-Kubernetes, Kuryr-tempest-plugin, Octavia, Neutron, Gophercloud, Terraform.
• Enhancement of Kuryr for OVN-based distributed load balancing, as one of the most remarkable features added to Kuryr, achieving ~3x throughput improvement and ~90% latency reduction. It also speeds up services creation by 1 order of magnitude (from mins to secs) and reduces the footprint related to resource utilization (one of the barriers for larger Kuryr adoption).
• Optimized deployment of big data applications through a recommendation engine for deployment configurations achieving 0.6896 NDCG (significantly more effective than other deployment recommendation baselines).
• Dynamic orchestration by combining heuristics with deep reinforcement learning, depicting a 35% improvement on SLO satisfaction compared to regular reinforcement learning.
• Triple monitoring for resources, applications and data services through an innovative federated model, and optimized metrics storage based on their usage, with performance optimization of about 90%.
• Fast and efficient log search capabilities over containers, with indexing and retrieval response times lower than 1-10th of a sec (100ms) when deployed on an OpenShift cluster with over 40 users and 32 million lines of log data (450 million tokens). 100ms is the SotA gold-standard for search engine response times.
• Domain-agnostic error detection applicable on a dataset, without domain-specific knowledge supporting numerical and categorical attributes.
• Adaptable distributed storage splitting a logical query into multiple queries, splitting and migrating regions when they become too big to be efficiently managed. It scales under operational workload with no impact on on-going transactions, where NoSQL approaches fail, and with no downtime or impact on the performance, where traditional RDBMS solutions fail.
• Seamless analytics on top of heterogeneous stores through a standard JDBC driver to submit analytical queries, without utilizing a new query language for data spanned across different stores. The framework also moves historic data from a DB to an object store ensuring the ACID properties. It can be used with any target store with a standard JDBC interface, or integrated with any JDBC-compatible processing framework compatible, which means with every data management technology. It can execute all SQL operations efficiently in a distributed fashion, even for aggregations or join operations.
• Enhanced data skipping technology enabling Apache Spark to reduce data ingestion of Spark SQL based jobs. Our framework is the first to natively support arbitrary data types (e.g. geospatial, timeseries or genomic data) and data skipping for queries with user defined functions.
• CEP parallelizing and distributing queries across servers and small devices. Self-adaptive and re-deploys queries or part of queries (subqueries) without stopping query processing, where SotA systems fail. CEP can be deployed in a data center and in a geo-distributed scenario.
• Application dimensioning considering data workloads to estimate resource needs. It enables service stress testing on a variety of execution platforms in a plug-in manner and sequential or parallel execution of the benchmarking.
• Process modelling and mapping facilitating the specification of high-level processes in analytics workflows and the automated mapping to analytics algorithms based on a meta-learning approach.
• Data toolkit enabling data scientists to constantly check the necessary interdependency rules of an analytics graph, and to set their requirements and preferences concerning both end-to-end graph objectives and requirements linked with specific analytic tasks.
• Deep-learning-based product recommendation models to support a retail use-case. It improves performance over the SotA models by 5% NDCG in average.