Abstract
A delay in a single processor may affect an entire system since the slowest processor typically determines the runtime. Problems with such stragglers are often mitigated using dynamic load balancing or redundancy solutions such as task replication. Unfortunately, the former option incurs high communication cost, and the latter significantly increases the arithmetic cost and memory footprint, making high resource overhead seem inevitable. Matrix multiplication and other numerical linear algebra kernels typically have structures that allow better straggler management. Redundancy based solutions tailored for such algorithms often combine codes in the algorithm’s structure. These solutions add fixed cost overhead and may perform worse than the original algorithm when little or no delays occur. We propose a new load-balancing solution tailored for distributed matrix multiplication. Our solution reduces latency overhead by O(P/ log P) compared to existing dynamic load-balancing solutions, where P is the number of processors. Our solution overtakes redundancy-based solutions in all parameters: arithmetic cost, bandwidth cost, latency cost, memory footprint, and the number of stragglers it can tolerate. Moreover, our overhead costs depend on the severity of delays and are negligible when delays are minor. We compare our solution with previous ones and demonstrate significant improvements in asymptotic analysis and simulations: up to x4.4 and x5.3 compared to general-purpose dynamic load balancing and redundancy-based solutions, respectively.
Original language | English |
---|---|
Title of host publication | Job Scheduling Strategies for Parallel Processing - 26th Workshop, JSSPP 2023, Revised Selected Papers |
Editors | Dalibor Klusáček, Julita Corbalán, Gonzalo P. Rodrigo |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 74-96 |
Number of pages | 23 |
ISBN (Print) | 9783031439421 |
DOIs | |
State | Published - 2023 |
Event | 26th workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2023 - St. Petersburg, United States Duration: 19 May 2023 → 19 May 2023 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 14283 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 26th workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2023 |
---|---|
Country/Territory | United States |
City | St. Petersburg |
Period | 19/05/23 → 19/05/23 |
Bibliographical note
Publisher Copyright:© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Keywords
- Distributed Computing
- Dynamic Load Balancing
- Numerical Linear Algebra
- Straggler Mitigation