Abstract
We present SALAD, a zero-shot text-to-speech (TTS) autoregressive model operating over continuous speech representations. SALAD utilizes a per-token diffusion process to refine and predict continuous representations for the next time step. We compare our approach against a discrete variant of SALAD as well as publicly available zero-shot TTS systems, and conduct a comprehensive analysis of discrete versus continuous modeling techniques. Our results show that SALAD achieves superior intelligibility while matching the speech quality and speaker similarity of ground-truth audio.
| Original language | English |
|---|---|
| Title of host publication | ASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| ISBN (Electronic) | 9798331544263 |
| DOIs | |
| State | Published - 2025 |
| Event | 2025 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025 - Honolulu, United States Duration: 6 Dec 2025 → 10 Dec 2025 |
Publication series
| Name | ASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop |
|---|
Conference
| Conference | 2025 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025 |
|---|---|
| Country/Territory | United States |
| City | Honolulu |
| Period | 6/12/25 → 10/12/25 |
Bibliographical note
Publisher Copyright:© 2025 IEEE.
Fingerprint
Dive into the research topics of 'Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver