Live Webcast 15th Annual Charm++ Workshop

ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection
International Conference for High Performance Computing, Networking, Storage and Analysis (SC) 2013
Publication Type: Paper
Repository URL: papers/201212_ReplicaFT
As machines increase in scale, many researchers have predicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft error rates are increasing, and hence they must be detected and handled to maintain correctness. We present a holistic methodology for automatically detecting and recovering from soft or hard faults with minimal application intervention. This is demonstrated by ACR: an automatic check- point/restart framework that performs application replication and automatically adapts the checkpoint period using online information about the current failure rate. ACR per- forms an application- and user-oblivious recovery. We empirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interaction between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.
Research Areas