We point out that common evaluation practices for cross-document coreference resolution have been unrealistically permissive in their assumed settings, yielding inflated results. We propose addressing this issue via two evaluation methodology principles. First, as in other tasks, models should be evaluated on predicted mentions rather than on gold mentions. Doing this raises a subtle issue regarding singleton coreference clusters, which we address by decoupling the evaluation of mention detection from that of coreference linking. Second, we argue that models should not exploit the synthetic topic structure of the standard ECB+ dataset, forcing models to confront the lexical ambiguity challenge, as intended by the dataset creators. We demonstrate empirically the drastic impact of our more realistic evaluation principles on a competitive model, yielding a score which is 33 F1 lower compared to evaluating by prior lenient practices.
|Original language||American English|
|Title of host publication||*SEM 2021 - 10th Conference on Lexical and Computational Semantics, Proceedings of the Conference|
|Editors||Lun-Wei Ku, Vivi Nastase, Ivan Vulic|
|Publisher||Association for Computational Linguistics (ACL)|
|Number of pages||9|
|State||Published - 2021|
|Event||10th Conference on Lexical and Computational Semantics, *SEM 2021 - Virtual, Bangkok, Thailand|
Duration: 5 Aug 2021 → 6 Aug 2021
|Name||*SEM 2021 - 10th Conference on Lexical and Computational Semantics, Proceedings of the Conference|
|Conference||10th Conference on Lexical and Computational Semantics, *SEM 2021|
|Period||5/08/21 → 6/08/21|
Bibliographical noteFunding Information:
We thank Shany Barhom for fruitful discussion and sharing code, and Yehudit Meged for providing her coreference predictions. The work described herein was supported in part by grants from Intel Labs, Facebook, the Israel Science Foundation grant 1951/17, the Israeli Ministry of Science and Technology, the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1), and from the Allen Institute for AI.
© 2021 Lexical and Computational Semantics