DATAGRID
Research and Technological Development for an International Data GRID

Abstract
The DataGrid project aims at developing, implementing and exploiting a large-scale data and CPU-oriented computational GRID. This new hardware/software infrastructure will allow geographically distributed processing of huge amounts of data coming from three different scientific disciplines: High Energy Physics, Biology and Earth Observation. The project is developing the necessary "middleware" software in collaboration with some of the leading centres of expertise in GRID technology, leveraging practice and experience from previous and current GRID initiatives in Europe and elsewhere. The project, which will last three years, has produced its first software release in September 2001. Two more major software releases are foreseen by September 2002 and September 2003. The first testbed has been deployed and demonstrated within the first year of activity. This testbed will be enlarged and used as a basis for important extensions and improvements during 2002 and 2003. The project will extend the state of the art in international, large-scale, data-intensive GRID computing, providing a solid base of knowledge and experience for exploitation by European Research first and eventually by Industry and Commerce.
Objectives
The objective of this project is to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scaled databases. The need of sharing huge amounts of data and computational resources across widely distributed communities is emerging in many scientific disciplines, including physics, biology, and earth sciences. Such sharing is complicated by the distributed nature of the resources to be used, the dispersed communities of scientists, the size of the data bases and the limited network bandwidth. To address these problems the DataGrid project is developing the necessary software components, building on emerging GRID technologies. The project is also implementing testbeds to demonstrate the usability of this software and the compliance with the requirements.
Organisation of the technical work packages in the DataGrid project
Technical Approach
The DataGrid Project is organised into 12 Work Packages. The first five Work Packages are developing the "middleware" software, based on existing GRID toolkits. Work Package 1 deals with workload scheduling and aims at defining and implementing an architecture for distributed scheduling and resource management. Work Package 2 deals with data management and aims at implementing and comparing different distributed data management approaches including caching, file replication and file migration. Work Package 3 is specifying, developing, integrating and testing tools and infrastructures to enable access to status and error information in a GRID environment. The objective of Work Package 4 is to develop new automated system management techniques that will enable the deployment of very large computing fabrics constructed from mass market components with reduced system administration and operation costs. Work Package 5 has two objectives: defining a common user API and data export/import interfaces to various local mass storage management systems used by project partners and publishing details of available storage systems via GRID information systems.
Testbed
Work Package 6 (Testbed) and Work Package 7 (Network) deal with planning, organising and operating the testbeds used to demonstrate and test the data and computing intensive GRID in production quality operation over high performance networks. Work Package 6 integrates successive releases of the software packages and makes these available for installation at the different testbed sites. It also plans and implements a series of progressively larger and more demanding demonstrations. Work Package 7 deals with network aspects of the project and aims at providing the GRID network management, ensuring the agreed quality of service, and providing information on network performance and reliability. The DataGrid project does not itself provide the network service but works in close collaboration with other European projects such as Geant.
Applications
The next particle accelerator "LHC" will be operational at CERN within a few years. The sheer computational capacity required to analyse the data produced by its detectors implies that the analysis must be performed at geographically distributed centres. DataGrid has to demonstrate the feasibility of the technology to implement and operate effectively an integrated computing and data access service for the LHC experiments in an internationally distributed environment.
The Earth Observation community collects data at distributed stations and maintains databases at different locations. DataGrid deploys the middleware to provide uniform access to these large, distributed archives as if they were a unique facility.
Molecular biology and genetics research use a large number of independent data bases and there are several national projects that aim at providing integrated access to these data bases. DataGrid middleware for the interconnection of these national testbeds.
Innovation
The project will produce a novel environment able to support globally distributed scientific exploration involving multi-PetaByte datasets. The project will devise and develop middleware solutions and testbeds capable of scaling in order to handle many PetaBytes of distributed data, tens of thousands of resources (processors, disks, ...), and thousands of simultaneous users. The project will produce this advanced and innovative environment by combining and extending newly emerging GRID technologies. A consequence of this project will be the diffusion of new strategies for scientific exploration, as access to fundamental scientific data is no longer constrained to the producer of that data. While the project focuses on scientific applications, issues of sharing data and computing power are germane to many applications and thus the project has a potential impact on future industrial and commercial activities.
- Project name:
- DataGrid
- Contract no:
- IST-2000-25182
- Project type:
- RTD
- Start date:
- 01/01/2001
- Duration:
- 36 months
- Total budget:
- €12,820,000
- Funding from the EC:
- €9,227,506
- Total effort in person-months:
- 3907
- Website:
- http://www.eu-datagrid.org
- Contact person:
- Dr. Fabrizio Gagliardi
- Fabrizio.Gagliardi@cern.ch
- tel: +41-22-7672374
- fax: +41-22-7677155
- Project participants:
- CEA - FR
- CERN - International
- CESNET - CZ
- CNR - IT
- CNRS - FR
- CS SI - FR
- Datamat - IT
- ESA-ESRIN - IT
- EVG HEI UNI D
- FOM - NL
- IBM - UK
- IFAE - ES
- INFN - IT
- ITC-IRST - IT
- KNMI - NL
- MTA-SZTAKI - HU
- NFR - S
- PPARC - UK
- SARA - NL
- UH - SF
- VR - S
- ZIB - D
- Keywords:
- Information processing
- Information systems
- Scientific research
- Telecommunication
- Collaboration with other EC funded projects:
- CrossGrid
- Damien
- DataTAG
- Eurogrid
- Geant
- GridSTART