Abstract
The generic problem of estimation and inference given a sequence of i.i.d. samples has been extensively studied in the statistics, property testing, and learning communities. A natural quantity of interest is the sample complexity of the particular learning or estimation problem being considered. While sample complexity is an important component of the computational efficiency of the task, it is also natural to consider the space complexity: do we need to store all the samples as they are drawn, or is it sufficient to use memory that is significantly sublinear in the sample complexity? Surprisingly, this aspect of the complexity of estimation has received significantly less attention in all but a few specific cases. While space-bounded, sequential computation is the purview of the field of data-stream computation, almost all of the literature on the algorithmic theory of data-streams considers only “empirical problems”, where the goal is to compute a function of the data present in the stream rather than to infer something about the source of the stream.
Our contributions are two-fold. First, we provide results connecting space efficiency to the estimation of robust statistics from a sequence of i.i.d. samples. Robust statistics are a particularly interesting class of statistics in our setting because, by definition, they are resilient to noise or errors in the sampled data. We show that this property is enough to ensure that very space-efficient stream algorithms exist for their estimation. In contrast, the numerical value of a “non-robust” statistic can change dramatically with additional samples, and this limits the utility of any finite length sequence of samples. Second, we present a general result that captures a trade-off between sample and space complexity in the context of distributional property testing.
Our contributions are two-fold. First, we provide results connecting space efficiency to the estimation of robust statistics from a sequence of i.i.d. samples. Robust statistics are a particularly interesting class of statistics in our setting because, by definition, they are resilient to noise or errors in the sampled data. We show that this property is enough to ensure that very space-efficient stream algorithms exist for their estimation. In contrast, the numerical value of a “non-robust” statistic can change dramatically with additional samples, and this limits the utility of any finite length sequence of samples. Second, we present a general result that captures a trade-off between sample and space complexity in the context of distributional property testing.
Original language | English |
---|---|
Title of host publication | Innovations in Computer Science |
Subtitle of host publication | ICS 2010 |
Publisher | Tsinghua University Press |
Pages | 251-265 |
Number of pages | 15 |
ISBN (Print) | 978-7-302-21752-7 |
State | Published - 2010 |
Event | Innovations in Computer Science: ICS 2010 - Tsinghua University, Beijing, China Duration: 5 Jan 2010 → 7 Jan 2010 https://conference.iiis.tsinghua.edu.cn/ICS2010/ |
Conference
Conference | Innovations in Computer Science |
---|---|
Country/Territory | China |
City | Beijing |
Period | 5/01/10 → 7/01/10 |
Internet address |
Keywords
- data streams
- property testing
- robust statistics