Stragglers in Distributed Matrix Multiplication

Roy Nissim*, Oded Schwartz

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A delay in a single processor may affect an entire system since the slowest processor typically determines the runtime. Problems with such stragglers are often mitigated using dynamic load balancing or redundancy solutions such as task replication. Unfortunately, the former option incurs high communication cost, and the latter significantly increases the arithmetic cost and memory footprint, making high resource overhead seem inevitable. Matrix multiplication and other numerical linear algebra kernels typically have structures that allow better straggler management. Redundancy based solutions tailored for such algorithms often combine codes in the algorithm’s structure. These solutions add fixed cost overhead and may perform worse than the original algorithm when little or no delays occur. We propose a new load-balancing solution tailored for distributed matrix multiplication. Our solution reduces latency overhead by O(P/ log P) compared to existing dynamic load-balancing solutions, where P is the number of processors. Our solution overtakes redundancy-based solutions in all parameters: arithmetic cost, bandwidth cost, latency cost, memory footprint, and the number of stragglers it can tolerate. Moreover, our overhead costs depend on the severity of delays and are negligible when delays are minor. We compare our solution with previous ones and demonstrate significant improvements in asymptotic analysis and simulations: up to x4.4 and x5.3 compared to general-purpose dynamic load balancing and redundancy-based solutions, respectively.

Original languageEnglish
Title of host publicationJob Scheduling Strategies for Parallel Processing - 26th Workshop, JSSPP 2023, Revised Selected Papers
EditorsDalibor Klusáček, Julita Corbalán, Gonzalo P. Rodrigo
PublisherSpringer Science and Business Media Deutschland GmbH
Pages74-96
Number of pages23
ISBN (Print)9783031439421
DOIs
StatePublished - 2023
Event26th workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2023 - St. Petersburg, United States
Duration: 19 May 202319 May 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14283 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference26th workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2023
Country/TerritoryUnited States
CitySt. Petersburg
Period19/05/2319/05/23

Bibliographical note

Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

Keywords

  • Distributed Computing
  • Dynamic Load Balancing
  • Numerical Linear Algebra
  • Straggler Mitigation

Fingerprint

Dive into the research topics of 'Stragglers in Distributed Matrix Multiplication'. Together they form a unique fingerprint.

Cite this