Direct Speech-to-Speech Translation With Discrete Units

Ann Lee, Peng Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, Wei Ning Hsu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

39 Scopus citations

Abstract

We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.

Original languageAmerican English
Title of host publicationACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
EditorsSmaranda Muresan, Preslav Nakov, Aline Villavicencio
PublisherAssociation for Computational Linguistics (ACL)
Pages3327-3339
Number of pages13
ISBN (Electronic)9781955917216
StatePublished - 2022
Externally publishedYes
Event60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 - Dublin, Ireland
Duration: 22 May 202227 May 2022

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
Volume1
ISSN (Print)0736-587X

Conference

Conference60th Annual Meeting of the Association for Computational Linguistics, ACL 2022
Country/TerritoryIreland
CityDublin
Period22/05/2227/05/22

Bibliographical note

Publisher Copyright:
© 2022 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'Direct Speech-to-Speech Translation With Discrete Units'. Together they form a unique fingerprint.

Cite this