Fault Tolerant Resource Efficient Matrix Multiplication.

Noam Birnbaum, Oded Schwartz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

General-purpose hard-error resiliency solutions such as checkpoint-restart severely degrade performance. For numerical linear algebra, more efficient solutions incur lower overhead. Current solutions require a significant increase in the number of processors. Further, they are based on distributed algorithms that guarantee good performance only when the matrices are large enough to fill all the local memories. Otherwise, their inter-processor communication costs are asymptotically larger than the lower bounds dictate.

We obtain fault tolerant parallel matrix multiplication algorithms that reduce the resource overhead by minimizing both the number of additional processors and the communication costs. In particular, we reduce the number of additional processors from Θ(h√P) to 1 (or from Θ (h√P) to h, where h is the maximum number of simultaneous faults), and we save a Θ (log P) factor of the latency costs. Further, for local memories larger then the minimum required to store the input and output, we obtain fault tolerant adaptations of the 2.5D algorithm that significantly reduce the communication costs, with very few additional processors.
Original languageEnglish
Title of host publicationCSC 2018
PublisherSociety for Industrial and Applied Mathematics
Pages23-34
Number of pages12
ISBN (Electronic)978-1-61197-521-5
DOIs
StatePublished - 2018
EventSIAM Workshop on Combinatorial Scientific Computing, CSC18 - Bergen, Norway
Duration: 6 Jun 20188 Jun 2018
https://epubs.siam.org/doi/10.1137/1.9781611975215

Conference

ConferenceSIAM Workshop on Combinatorial Scientific Computing, CSC18
Abbreviated titleCSC18
Country/TerritoryNorway
CityBergen
Period6/06/188/06/18
Internet address

Fingerprint

Dive into the research topics of 'Fault Tolerant Resource Efficient Matrix Multiplication.'. Together they form a unique fingerprint.

Cite this