Data Contamination: From Memorization to Exploitation

Inbal Magar, Roy Schwartz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations


Pretrained language models are typically trained on massive web-based datasets, which are often “contaminated” with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show that these two measures are affected by different factors such as the number of duplications of the contaminated data and the model size. Our results highlight the importance of analyzing massive web-scale datasets to verify that progress in NLP is obtained by better language understanding and not better data exploitation.

Original languageAmerican English
Title of host publicationACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Short Papers)
EditorsSmaranda Muresan, Preslav Nakov, Aline Villavicencio
PublisherAssociation for Computational Linguistics (ACL)
Number of pages9
ISBN (Electronic)9781955917223
StatePublished - 2022
Event60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 - Dublin, Ireland
Duration: 22 May 202227 May 2022

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X


Conference60th Annual Meeting of the Association for Computational Linguistics, ACL 2022

Bibliographical note

Publisher Copyright:
© 2022 Association for Computational Linguistics.


Dive into the research topics of 'Data Contamination: From Memorization to Exploitation'. Together they form a unique fingerprint.

Cite this