Abstract
Data repair, i.e., the identification and fix of errors in the data, is a central component of the Data Science cycle. As such, significant research effort has been devoted to automate the repair process. Yet it still requires significant manual labor by the Data Scientists, tweaking and optimizing repair modules (up to 80% of their time, according to surveys). To this end, we propose in this paper a novel framework for explaining the results of any data repair module. Explanations involve identifying the table cells and database constraints having the strongest influence on the process. Influence, in turn, is quantified through the game-theoretic notion of Shapley values, commonly used for explaining Machine Learning classifier results. The main technical challenge is that exact computation of Shapley values incurs exponential time. We consequently devise and optimize novel approximation algorithms, and analyze them both theoretically and empirically. Our results show the efficiency of our approach when compared to the alternative of adapting existing Shapley value computation techniques to the data repair settings.
Original language | English |
---|---|
Title of host publication | CIKM 2021 - Proceedings of the 30th ACM International Conference on Information and Knowledge Management |
Publisher | Association for Computing Machinery |
Pages | 362-371 |
Number of pages | 10 |
ISBN (Electronic) | 9781450384469 |
State | Published - 26 Oct 2021 |
Externally published | Yes |
Event | 30th ACM International Conference on Information and Knowledge Management, CIKM 2021 - Virtual, Online, Australia Duration: 1 Nov 2021 → 5 Nov 2021 |
Publication series
Name | International Conference on Information and Knowledge Management, Proceedings |
---|
Conference
Conference | 30th ACM International Conference on Information and Knowledge Management, CIKM 2021 |
---|---|
Country/Territory | Australia |
City | Virtual, Online |
Period | 1/11/21 → 5/11/21 |
Bibliographical note
Publisher Copyright:© 2021 ACM.
Keywords
- data repair
- denial constraints
- explainability
- shapley value