Abstract
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest RELIABLEEVAL – a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity.
| Original language | English |
|---|---|
| Title of host publication | EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025 |
| Editors | Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 11146-11153 |
| Number of pages | 8 |
| ISBN (Electronic) | 9798891763357 |
| DOIs | |
| State | Published - 2025 |
| Event | 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - Suzhou, China Duration: 4 Nov 2025 → 9 Nov 2025 |
Publication series
| Name | EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025 |
|---|
Conference
| Conference | 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 |
|---|---|
| Country/Territory | China |
| City | Suzhou |
| Period | 4/11/25 → 9/11/25 |
Bibliographical note
Publisher Copyright:©2025 Association for Computational Linguistics.
Fingerprint
Dive into the research topics of 'RELIABLEEVAL: A Recipe for Stochastic LLM Evaluation via Method of Moments'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver