Live Webcast 15th Annual Charm++ Workshop

Adaptive Runtime Support for Fault Tolerance
PPL Talk (PPL Talk) 2009
Publication Type: Talk
Repository URL:
Presented at Los Alamos Computer Science Symposium 2009, Santa Fe, NM

Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tolerate failures. The impact of these techniques and their scalability can be substantially enhanced by a parallel programming model called migratable objects. In this talk, we demonstrate how the migratable-objects model facilitates and improves several fault tolerance approaches. Our experimental results on hundreds of cores suggest fault tolerance schemes based on migratable objects have low performance overhead and high scalability.
Research Areas