Abstract
Deep neural networks (DNNs) excel in various applications, such as computer vision, natural language processing, and other mission-critical systems. As the computational complexity of these models grows, there is an increasing need for specialized accelerators to handle the demanding workloads. In response, advancements in Very Large Scale Integration (VLSI) process nodes have significantly intensified the development of machine learning (ML) accelerators, offering enhanced transistor miniaturization and power efficiency. However, the susceptibility of these advanced nodes to transistor aging poses risks to ML accelerator performance, prediction accuracy, and reliability, which can impact the functional safety of mission-critical systems. This study focuses on the impact of asymmetric transistor aging, induced by Bias Temperature Instability (BTI), on systolic arrays (SAs), which are integral to many ML accelerators in mission-critical systems. Our aging-aware analysis indicates that SAs experience asymmetric aging, causing logical elements to age at varying rates. In addition, our simulations show that asymmetric transistor aging introduces persistent and transient faults in the SA's datapath, compromising the overall resiliency of the ML model. Our simulation results show that even with less than 1% of transient failure events, the top-1 prediction accuracy of ResNet-18 ML model drops significantly by 32-50% and with approximately 0.8% of transient failure events PTQ4ViT drops by almost 90%. To address this issue, we propose new hardware mechanisms and design flow solutions that can successfully mitigate the impact of asymmetric transistor aging on ML accelerator reliability with minimal power and area overhead.
Original language | English |
---|---|
Pages (from-to) | 44041-44061 |
Number of pages | 21 |
Journal | IEEE Access |
Volume | 13 |
DOIs | |
State | Published - 2025 |
Bibliographical note
Publisher Copyright:© 2013 IEEE.
Keywords
- Asymmetric aging
- bias temperature instability
- deep neural networks
- machine learning accelerators
- mission critical applications
- systolic arrays
- transistor aging
- very large scale integration