Real time speech enhancement in the waveform domain

Alexandre Défossez, Gabriel Synnaeve, Yossi Adi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

287 Scopus citations

Abstract

We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working directly on the raw waveform.

Original languageEnglish
Title of host publicationInterspeech 2020
PublisherInternational Speech Communication Association
Pages3291-3295
Number of pages5
ISBN (Print)9781713820697
DOIs
StatePublished - 2020
Externally publishedYes
Event21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China
Duration: 25 Oct 202029 Oct 2020

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2020-October
ISSN (Print)2308-457X
ISSN (Electronic)1990-9772

Conference

Conference21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Country/TerritoryChina
CityShanghai
Period25/10/2029/10/20

Bibliographical note

Publisher Copyright:
© 2020 ISCA

Keywords

  • Neural networks
  • Raw waveform
  • Speech denoising
  • Speech enhancement

Fingerprint

Dive into the research topics of 'Real time speech enhancement in the waveform domain'. Together they form a unique fingerprint.

Cite this