HEBDB: a Weakly Supervised Dataset for Hebrew Speech Processing

Arnon Turetzky, Or Tal, Yael Segal-Feldman, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya R. Chernyak, Olga Seleznova, Joseph Keshet, Yossi Adi

Research output: Contribution to journalConference articlepeer-review

Abstract

We present HEBDB, a weakly supervised dataset for spoken language processing in the Hebrew language. HEBDB offers roughly 2500 hours of natural and spontaneous speech recordings in the Hebrew language, consisting of a large variety of speakers and topics. We provide raw recordings together with a pre-processed, weakly supervised, and filtered version. The goal of HEBDB is to further enhance research and development of spoken language processing tools for the Hebrew language. Hence, we additionally provide two baseline systems for Automatic Speech Recognition (ASR): (i) a self-supervised model; and (ii) a fully supervised model. We present the performance of these two methods optimized on HEBDB and compare them to current multi-lingual ASR alternatives. Results suggest the proposed method reaches better results than the evaluated baselines considering similar model sizes. Dataset, code, and models are publicly available under https://pages.cs.huji.ac.il/adiyoss-lab/HebDB/.

Original languageEnglish
Pages (from-to)1360-1364
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 1 Sep 20245 Sep 2024

Bibliographical note

Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.

Keywords

  • Automatic Speech Recognition
  • Hebrew Speech Technologies
  • Speech Benchmark

Fingerprint

Dive into the research topics of 'HEBDB: a Weakly Supervised Dataset for Hebrew Speech Processing'. Together they form a unique fingerprint.

Cite this