Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

Sravya Popuri, Peng Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei Ning Hsu, Ann Lee

Research output: Contribution to journalConference articlepeer-review

20 Scopus citations

Abstract

Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue. We take advantage of a recently proposed speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, and transfer pre-training and efficient partial finetuning techniques that work well for speech-to-text translation (S2T) to the S2UT domain by studying both speech encoder and discrete unit decoder pre-training. Our experiments on Spanish-English translation show that self-supervised pre-training consistently improves model performance compared with multitask learning with an average 6.6-12.1 BLEU gain, and it can be further combined with data augmentation techniques that apply MT to create weakly supervised training data.

Original languageEnglish
Pages (from-to)5195-5199
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2022-September
DOIs
StatePublished - 2022
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: 18 Sep 202222 Sep 2022

Bibliographical note

Publisher Copyright:
Copyright © 2022 ISCA.

Keywords

  • data augmentation
  • self-supervised pre-training
  • speech-to-speech translation

Fingerprint

Dive into the research topics of 'Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation'. Together they form a unique fingerprint.

Cite this