Parallel Programming Laboratory

Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++

| Gengbin Zheng | Chao Huang | Laxmikant Kale

Operating and Runtime Systems for High-end Computing Systems 2006

Publication Type: Paper

Repository URL: ftCompare

Download: [BIB] [PS] [PDF]

Abstract

As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Charm++ and Adaptive MPI (AMPI) run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.

TextRef

Gengbin Zheng and Chao Huang and Laxmikant V. Kale, "Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++", ACM SIGOPS Operating Systems Review: Operating and Runtime Systems for High-end Computing Systems, vol. 40, April 2006.

People

Research Areas

Fault Tolerance Support

Live Webcast 15th Annual Charm++ Workshop