Abstract
Histograms are a well studied and simple way to summarize data. As such, they are used extensively in a variety of applications that require estimates of data frequency values. Significant previous work has studied the problem of finding optimal histograms with respect to an error measure. In this paper we study the classic problem of finding an optimal histogram for a dataset, with a new twist: The histogram must contain at least n − k of the n data points. The k excluded data points are considered outliers. We consider two notions of excluding data items, by allowing arbitrary items to be excluded, or only removing items while retaining a consistent histogram. Polynomial algorithms are presented for these problems. Significant experimentation demonstrates that our algorithms work well in practice to reduce the histogram error.
Original language | English |
---|---|
Title of host publication | Advances in Database Technology - EDBT 2020 |
Subtitle of host publication | 23rd International Conference on Extending Database Technology, Proceedings |
Editors | Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Bohm, Dan Olteanu, George Fletcher, Arijit Khan, Bin Yang |
Publisher | OpenProceedings.org |
Pages | 181-192 |
Number of pages | 12 |
ISBN (Electronic) | 9783893180837 |
DOIs | |
State | Published - 2020 |
Externally published | Yes |
Event | 23rd International Conference on Extending Database Technology, EDBT 2020 - Copenhagen, Denmark Duration: 30 Mar 2020 → 2 Apr 2020 |
Publication series
Name | Advances in Database Technology - EDBT |
---|---|
Volume | 2020-March |
ISSN (Electronic) | 2367-2005 |
Conference
Conference | 23rd International Conference on Extending Database Technology, EDBT 2020 |
---|---|
Country/Territory | Denmark |
City | Copenhagen |
Period | 30/03/20 → 2/04/20 |
Bibliographical note
Publisher Copyright:© 2020 Copyright held by the owner/author(s).