Optimal histograms with outliers

Rachel Behar, Sara Cohen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Histograms are a well studied and simple way to summarize data. As such, they are used extensively in a variety of applications that require estimates of data frequency values. Significant previous work has studied the problem of finding optimal histograms with respect to an error measure. In this paper we study the classic problem of finding an optimal histogram for a dataset, with a new twist: The histogram must contain at least n − k of the n data points. The k excluded data points are considered outliers. We consider two notions of excluding data items, by allowing arbitrary items to be excluded, or only removing items while retaining a consistent histogram. Polynomial algorithms are presented for these problems. Significant experimentation demonstrates that our algorithms work well in practice to reduce the histogram error.

Original languageAmerican English
Title of host publicationAdvances in Database Technology - EDBT 2020
Subtitle of host publication23rd International Conference on Extending Database Technology, Proceedings
EditorsAngela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Bohm, Dan Olteanu, George Fletcher, Arijit Khan, Bin Yang
PublisherOpenProceedings.org
Pages181-192
Number of pages12
ISBN (Electronic)9783893180837
DOIs
StatePublished - 2020
Externally publishedYes
Event23rd International Conference on Extending Database Technology, EDBT 2020 - Copenhagen, Denmark
Duration: 30 Mar 20202 Apr 2020

Publication series

NameAdvances in Database Technology - EDBT
Volume2020-March
ISSN (Electronic)2367-2005

Conference

Conference23rd International Conference on Extending Database Technology, EDBT 2020
Country/TerritoryDenmark
CityCopenhagen
Period30/03/202/04/20

Bibliographical note

Publisher Copyright:
© 2020 Copyright held by the owner/author(s).

Fingerprint

Dive into the research topics of 'Optimal histograms with outliers'. Together they form a unique fingerprint.

Cite this