Seeing through noise: Visually driven speaker separation and enhancement

Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

60 Scopus citations

Abstract

Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model. We evaluate our method on two audio-visual datasets, GRID and TCD- TIMIT, and show that our method attains significant SDR and PESQ improvements over the raw video-to-speech predictions, and a well-known audio-only method.

Original languageEnglish
Title of host publication2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3051-3055
Number of pages5
ISBN (Print)9781538646588
DOIs
StatePublished - 10 Sep 2018
Event2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Calgary, Canada
Duration: 15 Apr 201820 Apr 2018

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2018-April
ISSN (Print)1520-6149

Conference

Conference2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018
Country/TerritoryCanada
CityCalgary
Period15/04/1820/04/18

Bibliographical note

Publisher Copyright:
© 2018 IEEE.

Keywords

  • Cocktail party problem
  • Speech separation
  • Speechreading
  • Visual speech processing

Fingerprint

Dive into the research topics of 'Seeing through noise: Visually driven speaker separation and enhancement'. Together they form a unique fingerprint.

Cite this