TY - GEN
T1 - Learning mixtures of arbitrary distributions over large discrete domains
AU - Rabani, Yuval
AU - Schulman, Leonard J.
AU - Swamy, Chaitanya
PY - 2014
Y1 - 2014
N2 - We give an algorithm for learning a mixture of unstructured distributions. This problem arises in various unsupervised learning scenarios, for example in learning topic models from a corpus of documents spanning several topics. We show how to learn the constituents of a mixture of k arbitrary distributions over a large discrete domain [n] = {1, 2,⋯,n} and the mixture weights, using O(npolylog n) samples. (In the topic-model learning setting, the mixture constituents correspond to the topic distributions.) This task is information-theoretically impossible for k > 1 under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the "sampling aperture", is a crucial parameter of the problem. We obtain the first bounds for this mixture-learning problem without imposing any assumptions on the mixture constituents. We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of 2k - 1. Thus, we achieve near-optimal dependence on n and optimal aperture. While the sample-size required by our algorithm depends exponentially on k, we prove that such a dependence is unavoidable when one considers general mixtures. A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.
AB - We give an algorithm for learning a mixture of unstructured distributions. This problem arises in various unsupervised learning scenarios, for example in learning topic models from a corpus of documents spanning several topics. We show how to learn the constituents of a mixture of k arbitrary distributions over a large discrete domain [n] = {1, 2,⋯,n} and the mixture weights, using O(npolylog n) samples. (In the topic-model learning setting, the mixture constituents correspond to the topic distributions.) This task is information-theoretically impossible for k > 1 under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the "sampling aperture", is a crucial parameter of the problem. We obtain the first bounds for this mixture-learning problem without imposing any assumptions on the mixture constituents. We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of 2k - 1. Thus, we achieve near-optimal dependence on n and optimal aperture. While the sample-size required by our algorithm depends exponentially on k, we prove that such a dependence is unavoidable when one considers general mixtures. A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.
KW - Convex geometry
KW - Linear programming
KW - Mixture learning
KW - Moment methods
KW - Randomized algorithms
KW - Spectral techniques
KW - Topic models
UR - http://www.scopus.com/inward/record.url?scp=84893305245&partnerID=8YFLogxK
U2 - 10.1145/2554797.2554818
DO - 10.1145/2554797.2554818
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84893305245
SN - 9781450322430
T3 - ITCS 2014 - Proceedings of the 2014 Conference on Innovations in Theoretical Computer Science
SP - 207
EP - 223
BT - ITCS 2014 - Proceedings of the 2014 Conference on Innovations in Theoretical Computer Science
PB - Association for Computing Machinery
T2 - 2014 5th Conference on Innovations in Theoretical Computer Science, ITCS 2014
Y2 - 12 January 2014 through 14 January 2014
ER -