TY - JOUR
T1 - State of What Art? A Call for Multi-Prompt LLM Evaluation
AU - Mizrahi, Moran
AU - Kaplan, Guy
AU - Malkin, Dan
AU - Dror, Rotem
AU - Shahaf, Dafna
AU - Stanovsky, Gabriel
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, in-volving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different perfor-mance, both absolute and relative. Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. down-stream development), ensuring a more reliable and meaningful assessment of LLM capabil-ities. We show that our metrics provide new insights into the strengths and limitations of current LLMs.
AB - Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, in-volving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different perfor-mance, both absolute and relative. Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. down-stream development), ensuring a more reliable and meaningful assessment of LLM capabil-ities. We show that our metrics provide new insights into the strengths and limitations of current LLMs.
UR - http://www.scopus.com/inward/record.url?scp=85201600863&partnerID=8YFLogxK
U2 - 10.1162/tacl_a_00681
DO - 10.1162/tacl_a_00681
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85201600863
SN - 2307-387X
VL - 12
SP - 933
EP - 949
JO - Transactions of the Association for Computational Linguistics
JF - Transactions of the Association for Computational Linguistics
ER -