Previous talks at the SCCS Colloquium

Atamert Rahma: Towards Soft Error Resilience in SWE with TeaMPI

SCCS Colloquium |


Increasing demand for HPC applications has resulted in large clusters with thousands of nodes that can suffer from various types of failures which are even expected to increase in future systems. Hence resilience must be taken into account while running such large applications. In this thesis we focus on soft errors which can corrupt the state of an SWE application without raising an error. We introduce and analyse different methods that can provide soft error resilience. While our first method can only provide soft error detection using hash-value comparison of the results, our second and third methods can additionally provide recovery within the application. Their recovery and detection mechanisms utilize process replication and admissibility validation of the results. We analyze the soft error outcomes of our methods in various situations using bit-flip injections, and discuss their advantages and drawbacks. We have found out that our second method which employs task sharing is the most efficient one in terms of performance. However our third method which depends on independent redundant computation can additionally correct some errors that otherwise would only be detectable using the other methods.

Bachelor's thesis talk. Atamert is advised by Philipp Samfass.