Proactive Runtime Detection of Aging-Related Silent Data Corruptions: A Bottom-Up Approach

Jiacheng Ma, Majd Ganaiem, Madeline Burbage, Theo Gregersen, Rachel McAmis, Freddy Gabbay, Baris Kasikci

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent advancements in semiconductor process technologies have unveiled the susceptibility of hardware circuits to reliability issues, especially those related to transistor aging. Transistor aging gradually degrades gate performance, eventually causing hardware to behave incorrectly. Such misbehaving hardware can result in silent data corruptions (SDCs) in software - -a type of failure that comes without logs or exceptions, but causes miscomputing instructions, bitflips, and broken cache coherency. Alas, while design efforts can be made to mitigate transistor aging, complete elimination of this problem during design and fabrication cannot be guaranteed. This emerging challenge calls for a mechanism that not only detects potentially aged hardware in the field, but also triggers software mitigations at application runtime.We propose Vega, a novel workflow that allows efficient detection of aging-related failures at software runtime. Vega leverages the well-studied gate-level modeling of aging effects to identify susceptible signal propagation paths that could fail due to transistor aging. It then utilizes formal verification techniques to generate short test cases that activate these paths and detect any failure within them. Vega integrates the test cases into a user application by directly fusing them together, or by packaging the test cases into a library that the application can invoke. We demonstrate our proposed techniques on the arithmetic logic unit and floating-point unit of a RISC-V CPU. We show that Vega generates effective test cases and integrates them into applications with an average of 0.8% performance overhead.

Original languageEnglish
Title of host publicationASPLOS 2024 - Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
PublisherAssociation for Computing Machinery
Pages220-235
Number of pages16
ISBN (Electronic)9798400703911
DOIs
StatePublished - 10 Apr 2025
Event29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024 - San Diego, United States
Duration: 27 Apr 20241 May 2024

Publication series

NameInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
Volume4

Conference

Conference29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
Country/TerritoryUnited States
CitySan Diego
Period27/04/241/05/24

Bibliographical note

Publisher Copyright:
© 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Fingerprint

Dive into the research topics of 'Proactive Runtime Detection of Aging-Related Silent Data Corruptions: A Bottom-Up Approach'. Together they form a unique fingerprint.

Cite this