Abstract
The research community has produced many successful self-supervised speech representation learning methods over the past few years. Discrete units have been utilized in various self-supervised learning frameworks, such as VQ-VAE [1], wav2vec 2.0 [2], Hu-BERT [3], and Wav2Seq [4]. This paper studies the impact of altering the granularity and improving the quality of these discrete acoustic units for pre-training encoder-only and encoder-decoder models. We systematically study the current proposals of using Byte-Pair Encoding (BPE) and new extensions that use cluster smoothing and Brown clustering. The quality of learned units is studied intrinsically using zero speech metrics and on the down-stream speech recognition (ASR) task. Our results suggest that longer-range units are helpful for encoder-decoder pre-training; however, encoder-only masked-prediction models cannot yet benefit from self-supervised word-like targets.
Original language | English |
---|---|
Title of host publication | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
ISBN (Electronic) | 9781728163277 |
DOIs | |
State | Published - 2023 |
Externally published | Yes |
Event | 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023 |
Publication series
Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
ISSN (Print) | 1520-6149 |
Conference
Conference | 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 |
---|---|
Country/Territory | Greece |
City | Rhodes Island |
Period | 4/06/23 → 10/06/23 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
Keywords
- representation learning
- self-supervision
- unit discovery