Navigating the Modern Evaluation Landscape: Considerations in Benchmarks and Frameworks for Large Language Models (LLMs)

Leshem Choshen*, Ariel Gera, Yotam Perlitz, Michal Shmueli-Scheuer, Gabriel Stanovsky

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

General-Purpose language models have changed the world of natural language processing, if not the world itself. The evaluation of such versatile models, while supposedly similar to evaluation of generation models before them, in fact presents a host of new evaluation challenges and opportunities. This tutorial welcomes people from diverse backgrounds and assumes little familiarity with metrics, datasets, prompts and benchmarks. It will lay the foundations and explain the basics and their importance, while touching on the major points and breakthroughs of the recent era of evaluation. We will contrast new to old approaches, from evaluating on multi-task benchmarks rather than on dedicated datasets to efficiency constraints, and from testing stability and prompts on in-context learning to using the models themselves as evaluation metrics. Finally, we will present a host of open research questions in the field of robsut, efficient, and reliable evaluation.

Original languageEnglish
Title of host publication2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Tutorial Summaries
EditorsRoman Klinger, Naoaki Okazaki
PublisherEuropean Language Resources Association (ELRA)
Pages19-25
Number of pages7
ISBN (Electronic)9782493814357
StatePublished - 2024
Event2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Torino, Italy
Duration: 20 May 202425 May 2024

Publication series

Name2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Tutorial Summaries

Conference

Conference2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024
Country/TerritoryItaly
CityTorino
Period20/05/2425/05/24

Bibliographical note

Publisher Copyright:
© 2024 ELRA Language Resource Association.

Keywords

  • Benchmarks
  • Language models
  • efficient evaluation
  • language model as metrics

Fingerprint

Dive into the research topics of 'Navigating the Modern Evaluation Landscape: Considerations in Benchmarks and Frameworks for Large Language Models (LLMs)'. Together they form a unique fingerprint.

Cite this