Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
My Menu
Saved Items

Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

Hai JinContact Information and Kai Hwang6

(5)  Huazhong University of Science and Technology, 430074 Wuhan, China
(6)  University of Southern California, 90007 Los Angeles, USA
Abstract
This paper presents a new striped and staggered checkpointing (SSC) scheme for multicomputer clusters. We consider serverless clusters, where local disks attached to cluster nodes collectively form a distributed RAID (redundant array of inexpensive disks) with a single I/O space. The distributed RAID is used to save the checkpoint files periodically. Striping enables parallel I/O on distributed disks. Staggering avoids network bottleneck in distributed disk I/O operations. With a fixed cluster size, we reveal the tradeoffs between these two speedup techniques. Our SSC approach allows dynamical reconfiguration to minimize message-logging requirements among concurrent software processes. We demonstrate how to reduce the checkpointing overhead by striping and staggering dynamically. For communication-intensive programs, our SCC scheme can significantly reduce the checkpointing overhead. Benchmark results prove the benefits of trading between stripe parallelism and distributed staggering. These results are useful to design efficient checkpointing schemes for fast rollback recovery from any single node (disk) failure in a cluster of computers.

Contact Information Hai Jin
Email: hjin@hust.edu.cn
Fulltext Preview (Small, Large)
Image of the first page of the fulltext

References secured to subscribers.



Export this chapter
Export this chapter as RIS | Text
 
Remote Address: 38.107.191.106 • Server: mpweb02
HTTP User Agent: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)