Tree-based fault-tolerant collective operations for MPI

Alexander Margolin*, Amnon Barak

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

With the increase in size and complexity of high-performance computing systems, the probability of failures, and the cost of recovery grow. Parallel applications running on these systems should be able to continue running in spite of node failures at arbitrary times. Collective operations are essential for many parallel MPI applications, and are often the first to detect such failures. This work presents tree-based fault-tolerant collective operations, which combine fault detection and recovery as an integral part each operation. We do this by extending existing tree-based algorithms, to allow for a collective operation to succeed despite failing nodes before or during its run. This differs from other approaches, where recovery takes place after a failure of such operations have failed. The article includes a comparison between the performance of the proposed algorithm and other approaches, as well as a simulator-based analysis of performance at scale.

Original languageEnglish
Article numbere5826
JournalConcurrency and Computation: Practice and Experience
Volume33
Issue number14
DOIs
StatePublished - 25 Jul 2021

Bibliographical note

Publisher Copyright:
© 2020 John Wiley & Sons, Ltd.

Keywords

  • Allreduce
  • MPI
  • collective operations
  • fault-tolerance

Fingerprint

Dive into the research topics of 'Tree-based fault-tolerant collective operations for MPI'. Together they form a unique fingerprint.

Cite this