Corrected trees for reliable group communication

  • Martin Küttler
  • , Maksym Planeta
  • , Jan Bierbaum
  • , Carsten Weinhold
  • , Hermann Härtig
  • , Amnon Barak
  • , Torsten Hoefler

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach - from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both, simulations and measurements, show our solution to achieve a latency reduction of 50% with up to six times fewer messages sent in comparison to existing schemes.

Original languageEnglish
Title of host publicationPPoPP 2019 - Proceedings of the 24th Principles and Practice of Parallel Programming
PublisherAssociation for Computing Machinery
Pages287-299
Number of pages13
ISBN (Electronic)9781450362252
DOIs
StatePublished - 16 Feb 2019
Event24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019 - Washington, United States
Duration: 16 Feb 201920 Feb 2019

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
ISSN (Print)1542-0205

Conference

Conference24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019
Country/TerritoryUnited States
CityWashington
Period16/02/1920/02/19

Bibliographical note

Publisher Copyright:
© 2019 Copyright held by the owner/author(s).

Keywords

  • Gossip
  • HPC
  • LogP model
  • Low-latency broadcast
  • MPI

Fingerprint

Dive into the research topics of 'Corrected trees for reliable group communication'. Together they form a unique fingerprint.

Cite this