While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game of vision-and-language associations (e.g., between werewolves and a full moon), used as a dynamic evaluation benchmark. Inspired by the popular card game Codenames, a “spymaster” gives a textual cue related to several visual candidates, and another player tries to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, allowing future data collection that can be used to develop models with better association abilities.
|Original language||American English|
|Title of host publication||Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022|
|Editors||S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh|
|Publisher||Neural information processing systems foundation|
|State||Published - 2022|
|Event||36th Conference on Neural Information Processing Systems, NeurIPS 2022 - New Orleans, United States|
Duration: 28 Nov 2022 → 9 Dec 2022
|Name||Advances in Neural Information Processing Systems|
|Conference||36th Conference on Neural Information Processing Systems, NeurIPS 2022|
|Period||28/11/22 → 9/12/22|
Bibliographical noteFunding Information:
We would like to thank Moran Mizrahi for a feedback regarding the players survey. We would also like to thank Jaemin Cho, Tom Hope, Yonatan Belinkov, Inbal Magar and Aviv Shamsian. This work was supported in part by the Center for Interdisciplinary Data Science Research at the Hebrew University of Jerusalem, and a research grant no. 2088 from the Israeli Ministry of Science and Technology.
© 2022 Neural information processing systems foundation. All rights reserved.