Skip to main content

Optimised Framework based on Rough Set Theory for Big Data Pre-processing in Certain and Imprecise Contexts

Periodic Reporting for period 1 - RoSTBiDFramework (Optimised Framework based on Rough Set Theory for Big Data Pre-processing in Certain and Imprecise Contexts)

Reporting period: 2017-03-01 to 2019-02-28

Big data is characterized by its Volume, Variety, Velocity, and Veracity. These 4Vs present complex limitations for enabling the derivation of meaningful information from big data. Hence, fixing the 4Vs is crucial to get valuable data. However, when dealing with these 4Vs, standard techniques have several limitations: (1) They depend on expert knowledge, which could mislead decision making with their possible subjective advices. (2) They are unable to decide how trustful data is when encountering missing data. (3) They are unable to deal with the intensive big data computations. Thus, these techniques are ineffective and they cannot fix the 4Vs at once.
This project’s overarching aim is to fill these research gaps by developing an optimized framework, dubbed RoSTBiDFramework, for big data pre-processing in certain and imprecise contexts. RoSTBiDFramework comprises mathematical models, namely Rough Set Theory (RST), for data pre-processing, hashing techniques for optimization, and distributed frameworks. This innovative idea of hybridizing different disciplines is the key to fix the 4Vs to help decision makers.

By applying RoSTBiDFramework to large pools of data gathered from any discipline/sector, only valuable and fine-grained data is generated to discern patterns and improve decision-making. The RoSTBiDFramework output, i.e. the pre-processed data, will become the main source of competition and growth for a variety of industries and businesses, allowing them to enhance their productivity and to create important value for the world economy. This is by increasing the quality of products and services, by minimizing risks, and by unearthing valuable insights that would otherwise remain hidden. In such a way, RoSTBiDFramework will help companies to have much more solid basics to outperform their peers.

RoSTBiDFramework is based on five interconnected Research Objectives (RO) (RO1) The design and implementation of a feature selection framework based on RST in a certain context (RO2) The formalisation and implementation of an optimised version of the framework (RO3) The derivation of a general formulation of RST to deal with missing attribute values (RO4) The formalisation and implementation of an optimised version of the framework for handling the veracity aspect (RO5) The demonstration of RoSTBiDFramework on real-world data.
The project covers three parts (I) Handling big data in a certain context (RO1&RO2) (II) Dealing with big data in an imprecise context (RO3&RO4) (III) Taking the research into practice (RO5).

In part I, the work first involved the set-up of a decentralised feature selection framework based on RST, where the architecture of the approach is governed by the structure of the MapReduce Spark model. To reduce the computational effort of the rough set computations, the proposed approach splits the given data set into partitions with smaller numbers of features, which are then processed in parallel. Results showed the efficiency of the RST solution as it performs big data feature selection without any significant information loss while decreasing the run time of the learning process. Secondly, the mathematical calculations of the RST concepts were revised as they were still considered expensive in terms of running time. Thus, an optimised feature selection approach was designed and implemented by using a hashing technique for the most time-consuming operations. The hashing technique was used to match similar features into the same bucket and maps the generated buckets into partitions to enable the splitting of the universe in a more appropriate way. Results showed that the optimized approach guarantees data dependency in the distributed environment, and ensures a lower computational cost.

In part II, and based on a mathematical redefinition of the main standard RST concepts (RO1), a decentralised RST feature selection approach was developed to deal with “lost” and “do not care” missing values. These new concepts were adapted to function under the MapReduce model. Results revealed that the proposed approach efficiently performs big data feature selection in an imprecise context without any significant information loss while handling missing values in the most convenient way (RO3. Based on the extra formulations to deal with the big data veracity, the developed approach was considered to be computationally expensive. Thus, to achieve RO4, the RO3 approach was expanded to give high quality reduced big data with missing values while responding as quickly as possible by using the hashing optimisation approach (RO2). Results showed the effectiveness of part II in dealing with big data while guaranteeing data dependency in the distributed environment, and ensuring a lower computational cost.

Part III took the project into practice via secondments at the hospital “Hôtel-Dieu de Paris”. The aim was to ensure that RoSTBiDFramework with its different merged approaches meets necessary real-world requirements and regulations. To assist epidemiologists, RoSTBiDFramework was applied on big data for cancer incidence prediction, which was provided in their analytical studies of public interest. Results showed that RoSTBiDFramework is relevant to real world big data applications as it could offer insights into the selected risk factors, speed up the learning process, ensure the performance of the cancer incidence prediction model, and help epidemiologists make better decisions as the data set obtained is much easier to interpret. Promising results were also obtained within other applications with radiologists for breast cancer analysis and companies for books sales prediction.
This project advances the state-of-the-art in an original and innovative way (1) It establishes an optimised framework based on RST for big data pre-processing in certain and imprecise contexts. It provides foundations for future development of improved analysis tools for big data mining and feature selection. (2) It develops automated dimensionality reduction techniques without requiring extra information. (3) It deals with the big data veracity aspect by distinguishing two types of missing data, handling them in the most convenient way and providing the basis for new tools to cover the imprecise space. (4) It is based on hashing techniques as an optimised methodology for a fast, accurate and a high-quality system response to deal with the big data feature selection task with and without the big data veracity aspect.

Outputs of RoSTBiDFramework will add to the EU competitiveness and strengthen scientific and technological bases of its Member States in big data as part of the Digital Agenda for Europe. RoSTBiDFramework is well aligned with the aims of H2020, which identifies big data in their Leadership in Enabling and Industrial Technologies programme as an important area to strengthen Europe’s industrial capacities and business perspectives.

As mentioned, RoSTBiDFramework can be applied in any discipline involving data. Data has become a key asset for both economy and society and spans broad areas such as weather, transport, energy consumption, or health. Using this data has become a crucial way for companies to outperform their peers by leveraging data-driven strategies to innovate, compete, and capture value. Due to its modularity, RoSTBiDFramework for data pre-processing and knowledge discovery has potential impact in all these areas.
RoSTBiDFramework Description