JUNE 18–22, 2017
FRANKFURT AM MAIN, GERMANY

Presentation Details

 
Name: Characterizing Faults, Errors & Failures in Extreme-Scale Computing Systems
 
Time: Wednesday, June 21, 2017
01:45 pm - 02:05 pm
 
Room:   Panorama 1
Messe Frankfurt
 
Breaks:12:30 pm - 01:45 pm Lunch
 
Speaker:   Christian Engelmann, ORNL
 
Abstract:   The path to exascale computing poses several research challenges. Resilience, i.e., providing efficiency and correctness in the presence of faults, errors and failures, is one of the most important challenges as systems scale up in component count and component reliability does not increase accordingly. This talk provides an overview of recent and ongoing resilience research activities at Oak Ridge National Laboratory, Argonne National Laboratory and Lawrence Livermore National Laboratory in developing the missing high-performance computing (HPC) fault model. This effort identifies, categorizes and models the fault, error and failure properties of today's HPC systems. It develops a taxonomy, catalog and models that capture the observed and inferred conditions in current systems and extrapolates this knowledge to exascale HPC systems.