Abstract
We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.1
| Original language | English |
|---|---|
| Pages (from-to) | 1336-1354 |
| Number of pages | 19 |
| Journal | Transactions of the Association for Computational Linguistics |
| Volume | 9 |
| DOIs | |
| State | Published - 6 Dec 2021 |
| Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
Fingerprint
Dive into the research topics of 'On generative spoken language modeling from raw audio'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver