Proactive Fault Tolerance in Large Systems
    
    Workshop on High Performance Computing Reliability Issues at HPCA (HPCRI) 2005
    Publication Type: Paper
    Repository URL: 
    Abstract
    High-performance systems with thousands of processors have been
introduced in the recent past, and systems with hundreds of
thousands of processors should become available in the near future.
Since failures are likely to be frequent in such systems, schemes
for dealing with faults are important. In this paper, we introduce
a new fault tolerance solution for parallel applications that
proactively migrates execution from a processor where a failure is
imminent. Our approach assumes that some failures are predictable,
and leverages the fact that current hardware devices contain
various features supporting early indication of faults. By using
the concepts of processor virtualization in Charm++ and Adaptive
MPI (AMPI), we describe a mechanism that migrates objects when a
failure is expected to arise in a given processor, without
requiring spare processors. After migrating objects, and applying a
load balancing scheme, the execution of an MPI application can
proceed and achieve optimized efficiency. We modify the
implementation of collective operations, such as reductions, so
that they continue to operate efficiently even after a processor is
evacuated and crashes. To demonstrate the feasibility of our
approach, we present preliminary performance data.
    TextRef
      
        Sayantan Chakravorty, Celso Mendes and L. V. Kale,
"Proactive Fault Tolerance in Large Systems",
HPCRI Workshop in conjunction with HPCA 2005, 2005.
      
    People
      
    Research Areas
      
  









