Skip to main navigation Skip to search Skip to main content

On generative spoken language modeling from raw audio

  • Kushal Lakhotia
  • , Eugene Kharitonov
  • , Wei Ning Hsu
  • , Yossi Adi
  • , Adam Polyak
  • , Benjamin Bolte
  • , Tu Anh Nguyen
  • , Jade Copet
  • , Alexei Baevski
  • , Abdelrahman Mohamed
  • , Emmanuel Dupoux

Research output: Contribution to journalArticlepeer-review

278 Scopus citations

Abstract

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.1

Original languageEnglish
Pages (from-to)1336-1354
Number of pages19
JournalTransactions of the Association for Computational Linguistics
Volume9
DOIs
StatePublished - 6 Dec 2021
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Fingerprint

Dive into the research topics of 'On generative spoken language modeling from raw audio'. Together they form a unique fingerprint.

Cite this