TY - JOUR
T1 - Communication-optimal parallel and sequential Cholesky decomposition
AU - Ballard, Grey
AU - Demmel, James
AU - Holtz, Olga
AU - Schwartz, Oded
PY - 2010
Y1 - 2010
N2 - Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [J. Demmel et al., Communication-optimal Parallel and Sequential QR and LU Factorizations, Technical report EECS-2008-89, University of California, Berkeley, CA, 2008], [J. Demmel et al., Implementing Communication-optimal Parallel and Sequential QR and LU Factorizations, SIAM. J. Sci. Comp., submitted], and [J. Demmel, L. Grigori, and H. Xiang, Communication-avoiding Gaussian Elimination, Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008] this gives a set of communication-optimal algorithms for O(n3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR, and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.
AB - Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [J. Demmel et al., Communication-optimal Parallel and Sequential QR and LU Factorizations, Technical report EECS-2008-89, University of California, Berkeley, CA, 2008], [J. Demmel et al., Implementing Communication-optimal Parallel and Sequential QR and LU Factorizations, SIAM. J. Sci. Comp., submitted], and [J. Demmel, L. Grigori, and H. Xiang, Communication-avoiding Gaussian Elimination, Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008] this gives a set of communication-optimal algorithms for O(n3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR, and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.
KW - Algorithm
KW - Bandwidth
KW - Cholesky decomposition
KW - Communication avoiding
KW - Latency
KW - Lower bound
UR - http://www.scopus.com/inward/record.url?scp=79251563454&partnerID=8YFLogxK
U2 - 10.1137/090760969
DO - 10.1137/090760969
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:79251563454
SN - 1064-8275
VL - 32
SP - 3495
EP - 3523
JO - SIAM Journal on Scientific Computing
JF - SIAM Journal on Scientific Computing
IS - 6
ER -