Skip to main content

Algorithmics for Metagenomics on Grids

Final Report Summary - METAGENOGRIDS (Algorithmics for Metagenomics on Grids)

The initial objective of this project was to address the problems of metagenomics assembly and metagenomics annotation on grid computing platforms. A preliminary study showed that algorithmic solutions to metagenomics annotation were already known. This result was obtained by changing the point of view on the problem and by showing its similarity with a problem of steady-state optimization of request processing. The three partners involved were not able to initiate a successful collaboration on metagenomics assembly. It was thus decided to refocus the project and to propose algorithmic tools generic for a large range of bioinformatics applications. Virtualization enables to hide most architectural and system characteristics from users. It is therefore appealing to users of distributed computing that are not computer scientists. This is especially true of bioinformaticians who have huge computational needs. We therefore addressed the problem of the efficient use of virtual machines at large scale. More specifically, we propose an alternate solution to batch scheduling for solving the problem of resource allocation management on computational clusters, by taking advantage of the capabilities offered by virtual machines. The aim of this new task was to enhance platform utilization (and thus to more efficiently use these platforms) while ensuring some Quality of Service to users (which is not done in today's systems). After an initial study of the complexity of this problem, we first addressed the problems of resource allocation for applications having constant resource consumption rates and infinite execution times. This unrealistic scenario served as a theoretical framework on which to acquire knowledge and intuitions. We then moved to the study of applications having constant resource consumption rates, but which are submitted over time (online scenario) and whose execution times are unknown (non-clairvoyant scenario). In this context, we designed several job scheduling algorithms. We presented results obtained in simulations for synthetic and real-world High Performance Computing (HPC) workloads, in which we compared our proposed algorithms with standard batch scheduling algorithms. We found that our approach provides drastic performance improvements over batch scheduling in terms of max-stretch minimization and significant improvement for platform utilization. In particular, we identified a few promising algorithms that perform well across most experimental scenarios. Our results demonstrate that virtualization technology coupled with lightweight scheduling strategies affords dramatic improvements in Quality of Service for HPC workloads while using less resources. The obtained results bear great promises to one day be able to design resource management systems delivering significantly better platform utilization than current systems, while ensuring some Quality of Service. To reach such a goal, the previous study should be extended to applications whose resource consumption rates change during the applications' executions. However ambitious this goal is, the foundations that were laid in this study appear to be strong enough to enable to fulfill it.

As a side project, we considered bag-of-tasks applications which are especially important in bioinformatics as they encompass applications such as parameter sweep applications. We studied a problem of steady-state optimization of bag-of-tasks application in a probabilistic context (in a non-clairvoyant setting, tasks computation and communication times are defined by distributions). To the best of our knowledge, this work is the first work on steady-state scheduling in a probabilistic context. This work shows that, in such a context, static approaches can still be designed, and still deliver better performance than system-like approaches, even in a non-clairvoyant setting.