## Abstract

Fast parallel and sequential matrix multiplication algorithms switch to the cubic time classical algorithm on small sub-blocks as the classical algorithm requires fewer operations on small blocks. We obtain a new algorithm that can outperform the classical one, even on small blocks, by trading multiplications with additions. This algorithm contradicts the common belief that the classical algorithm is the fastest algorithm for small blocks. To this end, we introduce commutative algorithms that generalize Winograd's folding technique (1968) and combine it with fast matrix multiplication algorithms. Thus, when a single scalar multiplication requires ρ times more clock cycles than an addition (e.g., for 16-bit integers on Intel's Skylake microarchitecture, ρ is between 1.5 and 5), our technique reduces the computation cost of multiplying the small sub-blocks by a factor of ρ + 3 over 2(ρ + 1) compared to using the classical algorithm, at the price of a low order term communication cost overhead both in the sequential and the parallel cases, thus reducing the total runtime of the algorithm. Our technique also reduces the energy cost of the algorithm. The ρ values for energy costs are typically larger than the ρ values for arithmetic costs. For example, we obtain an algorithm for multiplying 2 × 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2 × 2 matrices. However, we obtain this algorithm by bypassing the implicit assumptions of the lower bound. We provide a new lower bound matching our algorithm for 2 × 2 block multiplication, thus showing our technique is optimal.

Original language | American English |
---|---|

Title of host publication | SPAA 2023 - Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures |

Publisher | Association for Computing Machinery |

Pages | 379-390 |

Number of pages | 12 |

ISBN (Electronic) | 9781450395458 |

DOIs | |

State | Published - 17 Jun 2023 |

Event | 35th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2023 - Orlando, United States Duration: 17 Jun 2023 → 19 Jun 2023 |

### Publication series

Name | Annual ACM Symposium on Parallelism in Algorithms and Architectures |
---|

### Conference

Conference | 35th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2023 |
---|---|

Country/Territory | United States |

City | Orlando |

Period | 17/06/23 → 19/06/23 |

### Bibliographical note

Publisher Copyright:© 2023 Owner/Author.

## Keywords

- commutative matrix multiplication
- matrix multiplication