On node state reconstruction for fault tolerant distributed algorithms

Michael Okun*, Amnon Barak

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

One of the main methods for achieving fault tolerance in distributed systems is recovery of the state of failed components. Though generic recovery methods like checkpointing and message logging exist, in many cases the recovery has to be application specific. In this paper we propose a general model for a node state reconstruction after crash failures. In our model the reconstruction operation is defined only by the requirements it fulfills, without referring to the specific application dependent way it is performed. The model provides a framework for formal treatment of algorithm-specific and system-specific recovery procedures. It is used to specify node state reconstruction procedures for several widely used distributed algorithms and systems, as well as to prove their correctness.

Original languageEnglish
Pages (from-to)160-168
Number of pages9
JournalProceedings of the IEEE Symposium on Reliable Distributed Systems
StatePublished - 2002
EventThe 21st IEEE Symposium on Reliable Distributed Systems (SRDS-2002) - Suita, Japan
Duration: 13 Oct 200216 Oct 2002

Keywords

  • Distributed algorithms
  • Fault tolerance
  • Recovery
  • State reconstruction

Fingerprint

Dive into the research topics of 'On node state reconstruction for fault tolerant distributed algorithms'. Together they form a unique fingerprint.

Cite this