TY - GEN

T1 - Communication-optimal parallel algorithm for strassen's matrix multiplication

AU - Ballard, Grey

AU - Demmel, James

AU - Holtz, Olga

AU - Lipshitz, Benjamin

AU - Schwartz, Oded

PY - 2012

Y1 - 2012

N2 - Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n = 94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.

AB - Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n = 94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.

KW - Communication-avoiding algorithms

KW - Fast matrix multiplication

KW - Parallel algorithms

UR - http://www.scopus.com/inward/record.url?scp=84864147291&partnerID=8YFLogxK

U2 - 10.1145/2312005.2312044

DO - 10.1145/2312005.2312044

M3 - Conference contribution

AN - SCOPUS:84864147291

SN - 9781450312134

T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures

SP - 193

EP - 204

BT - SPAA'12 - Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures

T2 - 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'12

Y2 - 25 June 2012 through 27 June 2012

ER -