TY - GEN
T1 - Communication-optimal parallel and sequential cholesky decomposition
AU - Ballard, Grey
AU - Demmel, James
AU - Holtz, Olga
AU - Schwartz, Oded
PY - 2009
Y1 - 2009
N2 - Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the cost of various Cholesky decomposition implementations to this lower bound, and draw the following conclusions: (1) "Naïve" sequential algorithms for Cholesky attain neither the bandwidth nor latency lower bound (2) The sequential blocked algorithm in LAPACK (with the right block size), as well as various recursive algorithms [AP00, GJ01, AGW01, ST04], and one based on work of Toledo [Tol97], can attain the bandwidth lower bound. (3) The LAPACK algorithm can also attain the latency bound if used with blocked data structures rather than column-wise or row-wise matrix data structures, though the Toledo algorithm cannot. (4) The recursive sequential algorithm due to [AP00] attains the bandwidth and latency lower bounds at every level of a multi-level memory hierarchy, in a "cacheoblivious" way. (5) The parallel implementation of Cholesky in the ScaLAPACK library (again with the right block-size) attains both the bandwidth and latency lower bounds to within a poly-logarithmic factor. Combined with prior results in [DGHL08a, DGHL08b, DGX08] this gives a complete set of communication-optimal algorithms for O(n3) implementations of three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU and QR by optimizing communication for any number of levels of memory hierarchy.
AB - Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the cost of various Cholesky decomposition implementations to this lower bound, and draw the following conclusions: (1) "Naïve" sequential algorithms for Cholesky attain neither the bandwidth nor latency lower bound (2) The sequential blocked algorithm in LAPACK (with the right block size), as well as various recursive algorithms [AP00, GJ01, AGW01, ST04], and one based on work of Toledo [Tol97], can attain the bandwidth lower bound. (3) The LAPACK algorithm can also attain the latency bound if used with blocked data structures rather than column-wise or row-wise matrix data structures, though the Toledo algorithm cannot. (4) The recursive sequential algorithm due to [AP00] attains the bandwidth and latency lower bounds at every level of a multi-level memory hierarchy, in a "cacheoblivious" way. (5) The parallel implementation of Cholesky in the ScaLAPACK library (again with the right block-size) attains both the bandwidth and latency lower bounds to within a poly-logarithmic factor. Combined with prior results in [DGHL08a, DGHL08b, DGX08] this gives a complete set of communication-optimal algorithms for O(n3) implementations of three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU and QR by optimizing communication for any number of levels of memory hierarchy.
KW - Algorithm
KW - Bandwidth
KW - Cholesky decomposition
KW - Communication avoiding
KW - Latency
KW - Lower bound
UR - http://www.scopus.com/inward/record.url?scp=70449623419&partnerID=8YFLogxK
U2 - 10.1145/1583991.1584054
DO - 10.1145/1583991.1584054
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:70449623419
SN - 9781605586069
T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures
SP - 245
EP - 252
BT - SPAA'09 - Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures
PB - Association for Computing Machinery (ACM)
T2 - 21st Annual Symposium on Parallelism in Algorithms and Architectures, SPAA'09
Y2 - 11 August 2009 through 13 August 2009
ER -