Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications

Yoav Moran, Oded Schwartz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


Fast parallel and sequential matrix multiplication algorithms switch to the cubic time classical algorithm on small sub-blocks as the classical algorithm requires fewer operations on small blocks. We obtain a new algorithm that can outperform the classical one, even on small blocks, by trading multiplications with additions. This algorithm contradicts the common belief that the classical algorithm is the fastest algorithm for small blocks. To this end, we introduce commutative algorithms that generalize Winograd's folding technique (1968) and combine it with fast matrix multiplication algorithms. Thus, when a single scalar multiplication requires ρ times more clock cycles than an addition (e.g., for 16-bit integers on Intel's Skylake microarchitecture, ρ is between 1.5 and 5), our technique reduces the computation cost of multiplying the small sub-blocks by a factor of ρ + 3 over 2(ρ + 1) compared to using the classical algorithm, at the price of a low order term communication cost overhead both in the sequential and the parallel cases, thus reducing the total runtime of the algorithm. Our technique also reduces the energy cost of the algorithm. The ρ values for energy costs are typically larger than the ρ values for arithmetic costs. For example, we obtain an algorithm for multiplying 2 × 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2 × 2 matrices. However, we obtain this algorithm by bypassing the implicit assumptions of the lower bound. We provide a new lower bound matching our algorithm for 2 × 2 block multiplication, thus showing our technique is optimal.

Original languageAmerican English
Title of host publicationSPAA 2023 - Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures
PublisherAssociation for Computing Machinery
Number of pages12
ISBN (Electronic)9781450395458
StatePublished - 17 Jun 2023
Event35th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2023 - Orlando, United States
Duration: 17 Jun 202319 Jun 2023

Publication series

NameAnnual ACM Symposium on Parallelism in Algorithms and Architectures


Conference35th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2023
Country/TerritoryUnited States

Bibliographical note

Publisher Copyright:
© 2023 Owner/Author.


  • commutative matrix multiplication
  • matrix multiplication


Dive into the research topics of 'Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications'. Together they form a unique fingerprint.

Cite this