Approximating entropy from sublinear samples

Mickey Brautbar, Alex Samorodnitsky

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

We consider the problem of approximating the entropy of a discrete distribution P on a domain of size q, given access to n independent samples from the distribution. It is known that n ≥ q is necessary, in general, for a good additive estimate of the entropy. A problem of multiplicative entropy estimate was recently addressed by Batu, Dasgupta, Kumar, and Rubinfeld. They show that n = qα suffices for a factor-α approximation, α < 1. We introduce a new parameter of a distribution - its effective alphabet size qef (P). This is a more intrinsic property of the distribution depending only on its entropy moments. We show qef ≤ Õ(q). When the distribution P is essentially concentrated on a small part of the domain qef ≪ q. We strengthen the result of Batu et al. by showing it holds with qef replacing q. This has several implications. In particular the rate of convergence of the maximum-likelihood entropy estimator (the empirical entropy) for both nite and in nite alphabets is shown to be dictated by the effective alphabet size of the distribution. Several new, and some known, facts about this estimator follow easily. Our main result is algorithmic. Though the effective alphabet size is, in general, an unknown parameter of the distribution, we give an e cient procedure, with access to the alphabet size only, that achieves a factorα approximation of the entropy with n = Õ(exp {α1/4 · log3/4 q · log1/4 qef}). Assuming (for instance) log qef ≪ log q this is smaller than any power of q. Taking α → 1 leads in this case to e cient additive estimates for the entropy as well. In particular, this result shows that for many natural scenarios, a tight estimation of the entorpy may be achieved using a sub-linear sample. Several extensions of the results above are discussed.

Original languageEnglish
Title of host publicationProceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007
PublisherAssociation for Computing Machinery
Pages366-375
Number of pages10
ISBN (Electronic)9780898716245
StatePublished - 2007
Event18th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007 - New Orleans, United States
Duration: 7 Jan 20079 Jan 2007

Publication series

NameProceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms
Volume07-09-January-2007

Conference

Conference18th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007
Country/TerritoryUnited States
CityNew Orleans
Period7/01/079/01/07

Bibliographical note

Publisher Copyright:
Copyright © 2007 by the Association for Computing Machinery, Inc. and the Society for Industrial and Applied Mathematics.

Fingerprint

Dive into the research topics of 'Approximating entropy from sublinear samples'. Together they form a unique fingerprint.

Cite this