Corrected trees for reliable group communication

Martin Küttler, Maksym Planeta, Jan Bierbaum, Carsten Weinhold, Hermann Härtig, Amnon Barak, Torsten Hoefler

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach - from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both, simulations and measurements, show our solution to achieve a latency reduction of 50% with up to six times fewer messages sent in comparison to existing schemes.

Original languageEnglish
Title of host publicationPPoPP 2019 - Proceedings of the 24th Principles and Practice of Parallel Programming
PublisherAssociation for Computing Machinery
Pages287-299
Number of pages13
ISBN (Electronic)9781450362252
DOIs
StatePublished - 16 Feb 2019
Event24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019 - Washington, United States
Duration: 16 Feb 201920 Feb 2019

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

Conference

Conference24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019
Country/TerritoryUnited States
CityWashington
Period16/02/1920/02/19

Bibliographical note

Publisher Copyright:
© 2019 Copyright held by the owner/author(s).

Keywords

  • Gossip
  • HPC
  • LogP model
  • Low-latency broadcast
  • MPI

Fingerprint

Dive into the research topics of 'Corrected trees for reliable group communication'. Together they form a unique fingerprint.

Cite this