Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Joint Laboratory for Petascale Computing Workshop (JLPC) 2009
Publication Type: Talk
Repository URL:
HPC systems for Computational Science and Engineering have almost reached the threshold where some form of fault tolerance becomes mandatory. Although system-level checkpoint-restart keeps things simple for the application developer, they lead to high overhead. Meanwhile, application-level schemes are effort-intensive for the programmer. Schemes based on smart runtime systems appear to be at the right level for addressing fault tolerance. Our work, based on object-level virtualization and implemented by the Charm++ runtime system, supports such schemes. Charm++ offers a series of techniques that can help tolerate faults in large parallel systems. These techniques include distributed checkpoints, message-logging with parallel recovery, and proactive object migration. When combined with the measurement-based load balancing facilities available in Charm++, one can both tolerate faults and continue execution on remaining resources with optimal efficiency. These techniques can also be applied to MPI applications running under AMPI, an MPI implementation based on Charm++.
Research Areas