JUNE 18–22, 2017
FRANKFURT AM MAIN, GERMANY

Session Details

 
Name: Fault Tolerance for Next Generation High Performance Computing
 
Time: Wednesday, June 21, 2017
01:45 pm - 03:15 pm
 
Room:   Panorama 1
Messe Frankfurt
 
Breaks:03:15 pm - 03:45 pm Coffee Break
 
Chair:   Franck Cappello, ANL
 
Abstract:   Most of the scientific applications running at large scale feature checkpoint/restart codes to tolerate HPC system failures and extend the execution beyond the limit of the time allocation. The evolution of the HPC system characteristics (more faults, new storage hierarchy including non-volatile memory and Burst Buffers, limitation of the file system bandwidth) will impose for most applications to adapt their fault tolerance strategy. The session on Fault Tolerance for HPC will present talks bringing key insights concerning four major points toward adapting applications for future HPC systems: Failure characterization, Application Resiliency and failure injection, New Checkpoint/Restart techniques, and Resilient programming with MPI.  
 
Presentations: Characterizing Faults, Errors & Failures in Extreme-Scale Computing Systems
01:45 pm - 02:05 pm
  Christian Engelmann, ORNL
 
Evaluating Parallel Application Resiliency with the Software Fault Injector, PFSEFI
02:05 pm - 02:25 pm
  Nathan DeBardeleben, Los Alamos National Laboratory
 
Performance Portable Checkpoint/Restart with VeloC & UnifyCR
02:25 pm - 02:45 pm
  Kathryn Mohror, LLNL
 
Support for Resilience in Parallel Applications
02:45 pm - 03:05 pm
  George Bosilca, University of Tennessee
 
Questions & Answers
03:05 pm - 03:15 pm