According to multiple analyst estimates, around 80% to 90% of data is unstructured (text, images, audio, video), which means that it is not stored in searchable formats (like a database). As a consequence, huge pools of unstructured data are nowadays locked away in Object stores as bulk data that it is very difficult to mine and analyse. Unstructured Data Management (discovery, collection, mining, filtering and processing) is still a nightmare for data scientists as they have to cope with a proliferation of data sources from heterogeneous domains. Each of these domains includes different domain-specific data manipulation mechanisms, data pipelines, and governance models which data scientists must become acquainted to.
The NEARDATA project is focusing in three health data domains containing large unstructured data: metabolomics (images), genomics (text), and surgery data (video). The selected data domains comply perfectly with the definition of extreme data. First of all, OMICs setting have a relevant problem of huge and increasing data volumes that push current technologies to their limits. Here, there is a strong challenge in the data ingestion problem from Object Storage to computing analytics services. On the other hand, computer-assisted surgery shows an extreme problem of data speed because they require low latency real time video analytics and robotic IoT event streams. There is here a strong challenge to support real-time video streams that must be processed very fast to the Object Storage at large scale. Finally, health data in general is highly sensitive, so it has tough privacy and security requirements that preclude many hospitals and research labs to share, move, or publish their data openly. The big challenge here is to discover and distil meaningful, reliable and useful data from heterogeneous and dispersed/scarce sources in a trustworthy way with stringent confidential requirements.
This project is extremely ambitious regarding to the expected impact that it may achieve, First of all, it aims to improve European leadersip in the global data economy thanks to innovative data technologies. In particular, the projects aims to enable International Data Spaces in three Health domains (metabolomics, genomics, surgomics) which will boost innovation around nowadays complex unstructured data formats in those domains. The project already shows international leadership in metabolomics data, and it is creating innovative technologies for genomics and surgomics.