The State and Fate of Summarization Datasets: A Survey

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Automatic summarization has consistently attracted attention due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack of accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection,1,2 as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.

Original languageEnglish
Title of host publicationLong Papers
EditorsLuis Chiruzzo, Alan Ritter, Lu Wang
PublisherAssociation for Computational Linguistics (ACL)
Pages7259-7278
Number of pages20
ISBN (Electronic)9798891761896
DOIs
StatePublished - 2025
Event2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2025 - Hybrid, Albuquerque, United States
Duration: 29 Apr 20254 May 2025

Publication series

NameProceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025
Volume1

Conference

Conference2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2025
Country/TerritoryUnited States
CityHybrid, Albuquerque
Period29/04/254/05/25

Bibliographical note

Publisher Copyright:
© 2025 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'The State and Fate of Summarization Datasets: A Survey'. Together they form a unique fingerprint.

Cite this