Skip to main content

BigStorage: Storage-based Convergence between HPC and Cloud to handle Big Data

Periodic Reporting for period 2 - BigStorage (BigStorage: Storage-based Convergence between HPC and Cloud to handle Big Data)

Reporting period: 2017-01-01 to 2018-12-31

The consortium of this European Training Network (ETN) “BigStorage: Storage-based Convergence between HPC and Cloud to handle Big Data” will train future data scientists in order to enable them and us to apply holistic and interdisciplinary approaches for taking advantage of a data-overwhelmed world, which requires HPC and Cloud infrastructures with a redefinition of storage architectures underpinning them – focusing on meeting highly ambitious performance and energy usage objectives.

There has been an explosion of digital data, which is changing our knowledge about the world. The drivers for this data deluge are twofold: the interest of enterprises and agencies in collecting, processing, and publishing heterogeneous data, derived from multiple sources (e.g. sensors, scientific experiments) as well as citizens publishing content through channels such as social networks and Cloud systems. To gain value from this data it must be analysed and often combined or compared with simulated and predicted data. This huge data collection, which cannot be managed by current data management systems, is known as Big Data. Techniques to address it are gradually combining with what has been traditionally known as High Performance Computing. Therefore, this ETN will focus on the convergence of Big Data, HPC, and Cloud data storage, its management and analysis.

A number of initiatives and studies, such as PRACE and ETP4HPC , have emphasized the lack of professionals with the skills to address the EC goals to provide Europe with the necessary ecosystem of technology providers, research infrastructures, and application developers in HPC, Cloud, Storage, Energy, or Big Data to maintain Europe’s economy. Moreover, these reports also address the need of an interdisciplinary training, which enables the efficient interaction between application-domain, numerical analysis and computer-science in the HPC field.

To gain value from Big Data it must be addressed from many different angles: (i) applications, which can exploit this data, (ii) middleware, operating in the cloud and HPC environments, and (iii) infrastructure, which provides the Storage, and Computing capable of handling it.

Big Data can only be effectively exploited if techniques and algorithms are available, which help to understand its content, so that it can be processed by decision-making models. This is the main goal of Data Science, a new discipline related to Big Data that incorporates theories and tools from many areas, including statistics, machine learning, visualization, databases, or highly parallelised HPC programming.

We claim that this ETN project will be the ideal means to educate new researchers on the different facets of Data Science (across storage hardware and software architectures, large-scale distributed systems, data management services, data analysis, machine learning, decision making). Such a multifaceted expertise is mandatory to enable researchers to propose appropriate answers to applications requirements, while leveraging advanced data storage solutions unifying cloud and HPC storage facilities.
- Definition of a list of requirements of four use cases that are based on key projects in the research agenda in Europe: Human Brain Project, Square Kilometre Array, Climate modelling, and Smart cities. All four use cases have challenging requirements in both storage and data analysis, the main topics of the ETN.
- Definition of the state-of-the-art of Data Science.
- Definition of the state-of-the-art of specific techniques used in HPC and Cloud areas.
- Definition of the problems related to device technology and how these affect storage architecture and the storage stack.
- Definition of the state-of-the-art on energy-related research.
- Definition of benchmarks.
- Experimental study on energy drivers in HPC and Cloud computing.
- Proposal and evaluation of models and innovative techniques related to Data Science.
- Proposal and evaluation of models and techniques for Cloud and HPC.
- Proposal and evaluation of models and techniques for storage.
- Study on energy aspects of reducing the storage footprint and potential of application tailored data compression and deduplication techniques.
- Training activities related to the topics of the ETN (4 training schools)
- Organization of the workshop EBDMA, co-located with the prestigious and international conference IEEE/ACM CCGrid 2017.
- Organization of the workshop CEBDA, co-located with the prestigious and international conference IPDPS 2018.
- Communication activities in different forums.
- 32 publications in international journals, conferences and/or workshops by our Early Stage Researchers.
The Impact of this ETN will be multifaceted producing short and long-term impact on:
a) Education of ESRs in research methodology practically applied to data sciences and support technologies achieving an interdisciplinary understanding of an area that is of key importance and value to Europe.
b) Excellent prospects for the ESR’s within academic and industrial domains across Europe and worldwide given the increasing relevance of the ETN topic.
c) Creating and extending the collaborative research ecosystem in Big Data, HPC and Storage within Europe, in subject areas with critical future needs.
d) Improving the competitiveness of Europe, particularly through developed skills for Europe to better address the needs of Exascale computing and Big Data deluge.
e) Directly addressing known areas of skills shortage
f) Improving the international standing of the advisers and associated researchers within their fields.