Skip to main navigation Skip to search Skip to main content

Compute-based Fault Tolerance for DNN

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Deep neural network (DNN) systems use many GPUs, which can fail—making fault tolerance (FT) essential to avoid cluster restarts. Traditional FT relies on frequent checkpointing, incurring high bandwidth and memory costs. We propose an alternative strategy using GPU redundancy, introducing uniform and heterogeneous encoding approaches.

Original languageEnglish
Title of host publicationProceedings of the 18th ACM International Systems and Storage Conference, SYSTOR 2025
PublisherAssociation for Computing Machinery, Inc
Pages216
Number of pages1
ISBN (Electronic)9798400721199
DOIs
StatePublished - 8 Sep 2025
Event18th ACM International Systems and Storage Conference, SYSTOR 2025 - Virtual, Online, Israel
Duration: 8 Sep 20259 Sep 2025

Publication series

NameProceedings of the 18th ACM International Systems and Storage Conference, SYSTOR 2025

Conference

Conference18th ACM International Systems and Storage Conference, SYSTOR 2025
Country/TerritoryIsrael
CityVirtual, Online
Period8/09/259/09/25

Bibliographical note

Publisher Copyright:
© 2025 Copyright held by the owner/author(s).

Keywords

  • DNN
  • Distributed Systems
  • Fault Tolerance
  • GPU

Fingerprint

Dive into the research topics of 'Compute-based Fault Tolerance for DNN'. Together they form a unique fingerprint.

Cite this