Skip to main content
European Commission logo
English English
CORDIS - EU research results
CORDIS

Article Category

Content archived on 2023-03-02

Article available in the following languages:

EN

HPC4U Open Source Version Released

The HPC4U European research project active in GRID computing technologies just released the first freeware version of its fault tolerant grid middleware providing fault tolerance for parallel applications.

This system, based on a Linux kernel running as MS Windows service (coLinux), offers the user the possibility to launch parallel application on virtual nodes in order to test fault tolerance mechanisms in action. (User can start a parallel compute job on two compute nodes, killing one of these nodes and seeing, within a second, the job restarting on two other nodes.) Freeware Stack – an innovative service tailored for computing performance/optimisation HPC4U’s freeware version uses a coLinux system. It is a virtualisation which, in contrast to other systems such as VMware, does not emulate an entire machine but allows running the Linux kernel as a MS Windows Service. Using coLinux is very easy/makes it easier to run since the operating system is booted from a CD-Rom or a DVD device without any previous installation on the computer disk. This coLinux based system uses CCS and two free and open source components offering basic fault tolerance mechanisms for parallel applications. These components are respectively BLCR (Berkeley Lab Checkpoint/Restart) and LAM-MPI . BLCR allows programs running on Linux to be "checkpointed" (written entirely to a file), before being "restarted". BLCR performs checkpointing and restarting inside the Linux kernel. While this makes it less portable than solutions that use/offering/supporting user-level libraries, it also means that it has full access to all kernel resources and can thus restore resources (like process IDs) while user-level libraries cannot. In the future, this will also allow BLCR to checkpoint/restart entire sessions and/or process groups (such as shell scripts and their sub processes). LAM-MPI is an open-source implementation of the Message Passing Interface specification, including all of MPI-1.2 and much of MPI-2. One of the main advantages of using LAM-MPI in the HPC4U freeware bundle is the native compatibility with BLCR. Indeed, as detailed on their website, MPI applications running under LAM/MPI can be checkpointed to disk and restarted later/at a later time/stage. LAM requires a third party single-process checkpoint/restart toolkit to actually checkpoint and restart a single MPI process - LAM handles the parallel coordination. The combination of all these free and open source components coupled to CCS as Resource Management System developed by UPB and used by HPC4U will offer the possibility of testing HPC4U basic functionalities. ...Users will just have to boot (some) computer by using the provided DVD, transforming temporarily those computers into compute nodes, and will have to test fault tolerance mechanisms on a given application.... About HPC4U Grid computing is an established instrument/tool in the academic sector/when it comes to scientific research. It is used in numerous national and international research projects. Now the Industry is starting to acknowledge the potential of Grid computing. However the importance of standards such as reliability, transparency and Quality of Service (QoS) still need to be officially recognised as major requirements for the implementation of future Grids at a commercial level. The HPC4U project (Highly Predictable Cluster for Internet Grids) main objective is to provide an application-transparent and software-only solution of a reliable Resource Management System. It will allow the Grid to negotiate on Service Level Agreements, and it will also feature mechanisms like process and storage checkpointing to realise Fault Tolerance and to assure the adherence with given SLAs. The HPC4U solution will act as an active Grid component, using available Grid resources for further improving its level of Fault Tolerance. For more information: www.hpc4u.eu 1. http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml 2. http://www.lam-mpi.org/

Countries

Austria, Belgium, Czechia, Germany, Denmark, Estonia, Greece, Spain, Finland, France, Hungary, Ireland, Italy, Lithuania, Luxembourg, Latvia, Malta, Netherlands, Poland, Portugal, Sweden, Slovenia, Slovakia, United Kingdom