Given a partial description like “she opened the hood of the car,” humans can reason about the situation and anticipate what might come next (“then, she examined the engine”). In this paper, we introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning. We present Swag, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. To account for the aggressive adversarial filtering, we use state-of-the-art language models to massively oversample a diverse set of potential counterfactuals. Empirical results demonstrate that while humans can solve the resulting inference problems with high accuracy (88%), various competitive models struggle on our task. We provide comprehensive analysis that indicates significant opportunities for future research.
|Original language||American English|
|Title of host publication||Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018|
|Editors||Ellen Riloff, David Chiang, Julia Hockenmaier, Jun'ichi Tsujii|
|Publisher||Association for Computational Linguistics|
|Number of pages||12|
|State||Published - 2018|
|Event||2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 - Brussels, Belgium|
Duration: 31 Oct 2018 → 4 Nov 2018
|Name||Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018|
|Conference||2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018|
|Period||31/10/18 → 4/11/18|
Bibliographical noteFunding Information:
We thank the anonymous reviewers, members of the ARK and xlab at the University of Washington, researchers at the Allen Institute for AI, and Luke Zettlemoyer for their helpful feedback. We also thank the Mechanical Turk workers for doing a fantastic job with the human validation. This work was supported by the National Science Foundation Graduate Research Fellowship (DGE-1256082), the NSF grant (IIS-1524371, 1703166), the DARPA CwC program through ARO (W911NF-15-1-0543), the IARPA DIVA program through D17PC00343, and gifts by Google and Facebook. The views and conclusions contained herein are those of the authors and should not be interpreted as representing endorsements of IARPA, DOI/IBC, or the U.S. Government.
© 2018 Association for Computational Linguistics