## Abstract

General-purpose hard-error resiliency solutions such as checkpoint-restart severely degrade performance. For numerical linear algebra, more efficient solutions incur lower overhead. Current solutions require a significant increase in the number of processors. Further, they are based on distributed algorithms that guarantee good performance only when the matrices are large enough to fill all the local memories. Otherwise, their inter-processor communication costs are asymptotically larger than the lower bounds dictate.

We obtain fault tolerant parallel matrix multiplication algorithms that reduce the resource overhead by minimizing both the number of additional processors and the communication costs. In particular, we reduce the number of additional processors from Θ(h√P) to 1 (or from Θ (h√P) to h, where h is the maximum number of simultaneous faults), and we save a Θ (log P) factor of the latency costs. Further, for local memories larger then the minimum required to store the input and output, we obtain fault tolerant adaptations of the 2.5D algorithm that significantly reduce the communication costs, with very few additional processors.

We obtain fault tolerant parallel matrix multiplication algorithms that reduce the resource overhead by minimizing both the number of additional processors and the communication costs. In particular, we reduce the number of additional processors from Θ(h√P) to 1 (or from Θ (h√P) to h, where h is the maximum number of simultaneous faults), and we save a Θ (log P) factor of the latency costs. Further, for local memories larger then the minimum required to store the input and output, we obtain fault tolerant adaptations of the 2.5D algorithm that significantly reduce the communication costs, with very few additional processors.

Original language | English |
---|---|

Title of host publication | CSC 2018 |

Publisher | Society for Industrial and Applied Mathematics |

Pages | 23-34 |

Number of pages | 12 |

ISBN (Electronic) | 978-1-61197-521-5 |

DOIs | |

State | Published - 2018 |

Event | SIAM Workshop on Combinatorial Scientific Computing, CSC18 - Bergen, Norway Duration: 6 Jun 2018 → 8 Jun 2018 https://epubs.siam.org/doi/10.1137/1.9781611975215 |

### Conference

Conference | SIAM Workshop on Combinatorial Scientific Computing, CSC18 |
---|---|

Abbreviated title | CSC18 |

Country/Territory | Norway |

City | Bergen |

Period | 6/06/18 → 8/06/18 |

Internet address |