Parallel Programming Laboratory

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

| Eric Bohm

Joint Laboratory for Petascale Computing Workshop (JLPC) 2009

Publication Type: Talk

Repository URL:

Download:

Summary

HPC systems for Computational Science and Engineering have almost reached the threshold where some form of fault tolerance becomes mandatory. Although system-level checkpoint-restart keeps things simple for the application developer, they lead to high overhead. Meanwhile, application-level schemes are effort-intensive for the programmer. Schemes based on smart runtime systems appear to be at the right level for addressing fault tolerance. Our work, based on object-level virtualization and implemented by the Charm++ runtime system, supports such schemes. Charm++ offers a series of techniques that can help tolerate faults in large parallel systems. These techniques include distributed checkpoints, message-logging with parallel recovery, and proactive object migration. When combined with the measurement-based load balancing facilities available in Charm++, one can both tolerate faults and continue execution on remaining resources with optimal efficiency. These techniques can also be applied to MPI applications running under AMPI, an MPI implementation based on Charm++.

People

Eric Bohm

Research Areas

Live Webcast 15th Annual Charm++ Workshop