Skip to main navigation Skip to search Skip to main content

Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion

  • Arnon Turetzky*
  • , Avihu Dekel
  • , Nimrod Shabtay
  • , Slava Shechtman
  • , David Haws
  • , Hagai Aronowitz
  • , Ron Hoory
  • , Yossi Adi
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We present SALAD, a zero-shot text-to-speech (TTS) autoregressive model operating over continuous speech representations. SALAD utilizes a per-token diffusion process to refine and predict continuous representations for the next time step. We compare our approach against a discrete variant of SALAD as well as publicly available zero-shot TTS systems, and conduct a comprehensive analysis of discrete versus continuous modeling techniques. Our results show that SALAD achieves superior intelligibility while matching the speech quality and speaker similarity of ground-truth audio.

Original languageEnglish
Title of host publicationASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331544263
DOIs
StatePublished - 2025
Event2025 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025 - Honolulu, United States
Duration: 6 Dec 202510 Dec 2025

Publication series

NameASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop

Conference

Conference2025 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025
Country/TerritoryUnited States
CityHonolulu
Period6/12/2510/12/25

Bibliographical note

Publisher Copyright:
© 2025 IEEE.

Fingerprint

Dive into the research topics of 'Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion'. Together they form a unique fingerprint.

Cite this