Abstract
Deep neural network (DNN) systems use many GPUs, which can fail—making fault tolerance (FT) essential to avoid cluster restarts. Traditional FT relies on frequent checkpointing, incurring high bandwidth and memory costs. We propose an alternative strategy using GPU redundancy, introducing uniform and heterogeneous encoding approaches.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 18th ACM International Systems and Storage Conference, SYSTOR 2025 |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 216 |
| Number of pages | 1 |
| ISBN (Electronic) | 9798400721199 |
| DOIs | |
| State | Published - 8 Sep 2025 |
| Event | 18th ACM International Systems and Storage Conference, SYSTOR 2025 - Virtual, Online, Israel Duration: 8 Sep 2025 → 9 Sep 2025 |
Publication series
| Name | Proceedings of the 18th ACM International Systems and Storage Conference, SYSTOR 2025 |
|---|
Conference
| Conference | 18th ACM International Systems and Storage Conference, SYSTOR 2025 |
|---|---|
| Country/Territory | Israel |
| City | Virtual, Online |
| Period | 8/09/25 → 9/09/25 |
Bibliographical note
Publisher Copyright:© 2025 Copyright held by the owner/author(s).
Keywords
- DNN
- Distributed Systems
- Fault Tolerance
- GPU
Fingerprint
Dive into the research topics of 'Compute-based Fault Tolerance for DNN'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver