Evaluating question answering evaluation

Anthony Chen*, Gabriel Stanovsky, Sameer Singh, Matt Gardner

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

64 Scopus citations

Abstract

As the complexity of question answering (QA) datasets evolve, moving away from restricted formats like span extraction and multiplechoice (MC) to free-form answer generation, it is imperative to understand how well current metrics perform in evaluating QA. This is especially important as existing metrics (BLEU, ROUGE, METEOR, and F1) are computed using n-gram similarity and have a number of well-known drawbacks. In this work, we study the suitability of existing metrics in QA. For generative QA, we show that while current metrics do well on existing datasets, converting multiple-choice datasets into free-response datasets is challenging for current metrics. We also look at span-based QA, where F1 is a reasonable metric. We show that F1 may not be suitable for all extractive QA tasks depending on the answer types. Our study suggests that while current metrics may be suitable for existing QA datasets, they limit the complexity of QA datasets that can be created. This is especially true in the context of free-form QA, where we would like our models to be able to generate more complex and abstractive answers, thus necessitating new metrics that go beyond n-gram based matching. As a step towards a better QA metric, we explore using BERTScore, a recently proposed metric for evaluating translation, for QA. We find that although it fails to provide stronger correlation with human judgements, future work focused on tailoring a BERT-based metric to QA evaluation may prove fruitful.

Original languageEnglish
Title of host publicationMRQA@EMNLP 2019 - Proceedings of the 2nd Workshop on Machine Reading for Question Answering
PublisherAssociation for Computational Linguistics (ACL)
Pages119-124
Number of pages6
ISBN (Electronic)9781950737819
StatePublished - 2019
Externally publishedYes
Event2nd Workshop on Machine Reading for Question Answering, MRQA@EMNLP 2019 - Hong Kong, China
Duration: 4 Nov 2019 → …

Publication series

NameMRQA@EMNLP 2019 - Proceedings of the 2nd Workshop on Machine Reading for Question Answering

Conference

Conference2nd Workshop on Machine Reading for Question Answering, MRQA@EMNLP 2019
Country/TerritoryChina
CityHong Kong
Period4/11/19 → …

Bibliographical note

Publisher Copyright:
© 2019 MRQA@EMNLP 2019 - Proceedings of the 2nd Workshop on Machine Reading for Question Answering. All rights reserved.

Fingerprint

Dive into the research topics of 'Evaluating question answering evaluation'. Together they form a unique fingerprint.

Cite this