TY - JOUR
T1 - On generative spoken language modeling from raw audio
AU - Lakhotia, Kushal
AU - Kharitonov, Eugene
AU - Hsu, Wei Ning
AU - Adi, Yossi
AU - Polyak, Adam
AU - Bolte, Benjamin
AU - Nguyen, Tu Anh
AU - Copet, Jade
AU - Baevski, Alexei
AU - Mohamed, Abdelrahman
AU - Dupoux, Emmanuel
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
PY - 2021/12/6
Y1 - 2021/12/6
N2 - We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.1
AB - We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.1
UR - http://www.scopus.com/inward/record.url?scp=85121118256&partnerID=8YFLogxK
U2 - 10.1162/tacl_a_00430
DO - 10.1162/tacl_a_00430
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85121118256
SN - 2307-387X
VL - 9
SP - 1336
EP - 1354
JO - Transactions of the Association for Computational Linguistics
JF - Transactions of the Association for Computational Linguistics
ER -