Abstract
Exascale machines have a small mean time between failures, necessitating fault tolerance. Out-of-the-box fault-tolerant solutions, such as checkpoint-restart and replication, apply to any algorithm but incur significant overhead costs. Long integer multiplication is a fundamental kernel in numerical linear algebra and cryptography. The na ve, schoolbook multiplication algorithm runs inΘ ( n2k while Toom-Cook algorithms runs in Θ ( nlogκ (2κ-1) for 2 ≤ κ. We obtain the first efficient fault-tolerant parallel Toom-Cook algorithm. While asymptotically faster FFT-based algorithms exist, Toom-Cook algorithms are often favored in practice on small scale and on supercomputers. Our algorithm enables fault tolerance with negligible overhead costs. Compared to existing, general-purpose, faulttolerant solutions, our algorithm reduces the arithmetic and communication (bandwidth) overhead costs by a factor of Θ P (2κ-1) (where P is the number of processors). To this end, we adapt the fault-tolerant BFS-DFS method of Birnbaum et al. (2020) for fast matrix multiplication and combine it with a coding strategy tailored for Toom-Cook. This eliminates the need for recomputations, resulting in a much faster algorithm..
Original language | English |
---|---|
Title of host publication | SPAA 2024 - Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures |
Publisher | Association for Computing Machinery |
Pages | 207-218 |
Number of pages | 12 |
ISBN (Electronic) | 9798400704161 |
DOIs | |
State | Published - 17 Jun 2024 |
Event | 36th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2024 - Nantes, France Duration: 17 Jun 2024 → 21 Jun 2024 |
Publication series
Name | Annual ACM Symposium on Parallelism in Algorithms and Architectures |
---|---|
ISSN (Print) | 1548-6109 |
Conference
Conference | 36th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2024 |
---|---|
Country/Territory | France |
City | Nantes |
Period | 17/06/24 → 21/06/24 |
Bibliographical note
Publisher Copyright:© 2024 Owner/Author.
Keywords
- fault tolerance
- i/o complexity
- long integer multiplication
- parallel computing
- toom-cook