TY - JOUR
T1 - Generative Spoken Dialogue Language Modeling
AU - Nguyen, Tu Anh
AU - Kharitonov, Eugene
AU - Copet, Jade
AU - Adi, Yossi
AU - Hsu, Wei Ning
AU - Elkahky, Ali
AU - Tomasello, Paden
AU - Algayres, Robin
AU - Sagot, Benoît
AU - Mohamed, Abdelrahman
AU - Dupoux, Emmanuel
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023/3/14
Y1 - 2023/3/14
N2 - We introduce dGSLM, the first ‘‘textless’’ model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.1,2.
AB - We introduce dGSLM, the first ‘‘textless’’ model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.1,2.
UR - http://www.scopus.com/inward/record.url?scp=85150992661&partnerID=8YFLogxK
U2 - 10.1162/tacl_a_00545
DO - 10.1162/tacl_a_00545
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85150992661
SN - 2307-387X
VL - 11
SP - 250
EP - 266
JO - Transactions of the Association for Computational Linguistics
JF - Transactions of the Association for Computational Linguistics
ER -