TY - GEN
T1 - Communication-optimal parallel algorithm for strassen's matrix multiplication
AU - Ballard, Grey
AU - Demmel, James
AU - Holtz, Olga
AU - Lipshitz, Benjamin
AU - Schwartz, Oded
PY - 2012
Y1 - 2012
N2 - Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n = 94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.
AB - Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n = 94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.
KW - Communication-avoiding algorithms
KW - Fast matrix multiplication
KW - Parallel algorithms
UR - http://www.scopus.com/inward/record.url?scp=84864147291&partnerID=8YFLogxK
U2 - 10.1145/2312005.2312044
DO - 10.1145/2312005.2312044
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84864147291
SN - 9781450312134
T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures
SP - 193
EP - 204
BT - SPAA'12 - Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures
T2 - 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'12
Y2 - 25 June 2012 through 27 June 2012
ER -