Abstract
Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach - from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both, simulations and measurements, show our solution to achieve a latency reduction of 50% with up to six times fewer messages sent in comparison to existing schemes.
| Original language | English |
|---|---|
| Title of host publication | PPoPP 2019 - Proceedings of the 24th Principles and Practice of Parallel Programming |
| Publisher | Association for Computing Machinery |
| Pages | 287-299 |
| Number of pages | 13 |
| ISBN (Electronic) | 9781450362252 |
| DOIs | |
| State | Published - 16 Feb 2019 |
| Event | 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019 - Washington, United States Duration: 16 Feb 2019 → 20 Feb 2019 |
Publication series
| Name | Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP |
|---|---|
| ISSN (Print) | 1542-0205 |
Conference
| Conference | 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019 |
|---|---|
| Country/Territory | United States |
| City | Washington |
| Period | 16/02/19 → 20/02/19 |
Bibliographical note
Publisher Copyright:© 2019 Copyright held by the owner/author(s).
Keywords
- Gossip
- HPC
- LogP model
- Low-latency broadcast
- MPI
Fingerprint
Dive into the research topics of 'Corrected trees for reliable group communication'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver