Previous talks at the SCCS Colloquium

Simon Schuck: Integrating Task Sharing with Team Recovery for Hard Failure Tolerance in teaMPI and SWE

SCCS Colloquium |


Hard failure tolerance becomes ever more important as the scale of high performance computing systems increases and their mean time between failures grows smaller. Checkpoint/Restart has been the conventional way of recovering from hard failures, but becomes increasingly more expensive as the trend of upscaling through parallelization continues. This bachelor’s thesis will explore an approach to combine replication with task sharing and reactive checkpoint/restart to create a cheap and performant way of dealing with hard failures. With replication, we avoid losing data in case of a failure. We can use a failed process’s replica to create a checkpoint on disk and spawn a new process which loads that checkpoint and replaces the failed one. To counteract the grave performance impact of replication we employ task outcome sharing between the replicas. This results in a resilient yet performant approach, that can even keep up with more conservative proactive checkpoint/restart techniques and provides a promising alternative for future exascale scenarios.

Keywords: HPC, Fault Tolerance

Bachelor's thesis submission talk. Simon is advised by Philipp Samfaß.