Data Engineering for Data Science

Data is now a critical corporate asset. A typical data value creation chain encompasses multiple disciplines and people with different roles, of which Data Science and Data Engineering are two prominent examples. In setting up these pipelines and functionalities that comprise the ecosystem, data engineers are often faced with challenges posed by extreme characteristics of Big Data. Addressing these challenges demand innovative technological solutions for data management, involving new architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an organization. The Data Engineering for Data Science (DEDS) program aims at carrying out foundational research within the Big Data value chain to develop new technologies that improve the data management efficiency, and specifically for Data Science. DEDS is jointly organized by Université Libre de Bruxelles (Belgium), Universitat Politècnica de Catalunya (Spain), Aalborg Universitet (Denmark), and the Athena Research and Innovation Institute (Greece). Partner organizations from research, industry and the public sector prominently contribute to the program by training students and providing secondments in a wide range of domains including Energy, Finance, Health, Transport, and Customer Relationship and Support. DEDS is a 3-year doctoral program based on a co-tutelle model. A complementary set of joint doctoral projects focus on the main aspects of holistic management of the full data lifecycle.

The DEDS program has reached its halfway M24 with excellent progress towards its goals. The researchers involved in the program are progressing in their studies, with all needed support already established. An evaluation has already been conducted, and the majority (12 out of 15) have successfully passed. We group the identified contributions needed to unlock value from raw data into the following four functional modules:
- The Governance module:
-- Semantic-aware heterogeneity management: The project will utilize state-of-the-art tools and techniques from the semantic web, knowledge graphs, natural language processing, and machine learning to build a flexible and scalable framework that will enable more accurate inferencing and sophisticated data-driven decision-making.
-- Privacy-aware data integration: The major work of this project will be a framework and library for data processing, evaluation, and synthesis. This framework will be machine learning library agnostic, so-as-to allow
end-users to develop, test and implement new data synthesis algorithms with the tools they prefer.

- The Storage and Processing module:
-- Distribution and replication for feature selection: this project focuses on multi-objective feature selection with a focus on fairness
-- Transparent in-situ data processing: data processing engines need performance improvements to keep up with modern hardware heterogeneity
-- Model-based storage for time series: the current implementation of the time-series database ModelarDB requires performance optimizations
-- Analytic operators for trajectories: this project aims to transform large trajectory data into actionable insights for various stakeholders
-- End-to-end optimization for data science in the wild: developing the next generation of optimizers for the data science pipeline to significantly impact the cost of processing data
-- Physical optimization for large scale, DS workloads: a state-of-the-art survey on physical optimization for large-scale data science workloads resulted in understanding use cases, learning approaches, technical limitations, trade-offs in design choices, and new challenges with learning as a core component of the optimization pipeline.

- The Preparation module:
-- Spatio-temporal data integration and analysis: the state of art survey highlights the limited amount of quantified information concerning emissions of green house gas, We aim to contribute to a more sustainable transportation
-- Unified information extraction for data preparation: we anticipate facilitating the information extraction process and addressing the challenges to its overall deployment

- The Analysis module:
-- Recommendations and data exploration: modeling data with graphs and combining it with recommendations for complex data exploration
-- Adversarial learning fraud detection: the work focuses on studying fraud detection under an adversarial learning framework, addressing the exploitation of machine learning algorithms by fraudsters.
-- Prescriptive analytics: the project's main objective is to develop approximate query processing techniques for predictive and prescriptive analytics

- The research project aims to develop a flexible and scalable data integration framework, enabling precise inferencing and data-driven decision-making, with applications in multiple application domains
- The research on distribution and replication for feature selection, addresses scalability challenges in machine learning, particularly in domains like telecommunication, weather forecasting, and medicine that generate high volumes of data
- The research on transparent in-situ data processing targets improving processing performance in real-world data processing engines through optimized heterogeneous CPU-GPU processing approaches.
- The project on model-based storage for time series contributes to enhancing open-source database systems, benefiting the integration of IoT into renewable energy production and enabling higher frequency data processing
- The research on analytic operators for trajectories emphasizes the importance of advanced storage and processing techniques in databases, specifically in the context of analyzing trajectory data using MobilityDB.
- The project on end-to-end optimization for data science aims to optimize data science workflows, automate tasks, reduce costs, and facilitate collaboration within data science teams
- The research on physical optimization for large-scale data science workloads focuses on learned query optimizers that can improve business productivity, user experience, data analytics, and resource utilization while reducing costs and increasing automation.
- The research on feature extraction suggests that explainability can enhance model performance, particularly in weak supervision scenarios, and aims to make information extraction deployment affordable for everyone.
- The research on emission analysis aim to contribute to a sustainable society by providing quantified information on greenhouse gas emissions and facilitating information extraction processes
- The research on complex data management workflows aims to develop efficient and scalable methods for improved performance in predictive and prescriptive analytics. The goal is to enable transparent, scalable, and error-free analysis of high-volume IoT data streams.
- The researchers are currently working on integrating unstructured data analysis into complex analytics workflows. The goal is to further extend the framework to support arbitrary group search queries, enabling applications based on prescriptive analytics in big data analysis.

Periodic Reporting for period 1 - DEDS (Data Engineering for Data Science)

Partager cette page

Télécharger