Efficient, Language-Based Checkpointing for Massively Parallel Programs
PPL Technical Report 2005
Publication Type: Paper
Checkpointing and restart is an approach to ensuring forward progress of a program in spite of system failures or planned interruptions. We investigate issues in checkpointing and restart of programs running on massively parallel computers. We identify a new set of issues that have to be considered for the MPP platform, based on which we have designed an approach based on the language and run-time system. Hence our checkpointing facility can be used on virtually any parallel machine in a portable manner, irrespective of whether the operating system supports checkpointing. We present methods to make checkpointing and restart space- and time-efficient, including object-specific functions that save the state of an object. We present techniques to automatically generate checkpointing code for parallel objects, without programmer intervention. We also present mechanisms to allow the programmer to easily incorporate application specific knowledge selectively to make the checkpointing more efficient. The techniques developed here have been implemented in the Charm++ parallel object-oriented programming language and run-time system. Performance results are presented for the checkpointing overhead of programs running on parallel machines.
Sanjeev Krishnan and L. V. Kale, "Efficient, Language-Based Checkpointing for Massively Parallel Programs", Parallel Programming Laboratory, Department of Computer Science , University of Illinois, Urbana-Champaign, January 1995.