Visual speech enhancement

Aviv Gabbay, Asaph Shamir, Shmuel Peleg

Research output: Contribution to journalConference articlepeer-review

85 Scopus citations

Abstract

When video is shot in noisy environment, the voice of a speaker seen in the video can be enhanced using the visible mouth movements, reducing background noise. While most existing methods use audio-only inputs, improved performance is obtained with our visual speech enhancement, based on an audio-visual neural network. We include in the training data videos to which we added the voice of the target speaker as background noise. Since the audio input is not sufficient to separate the voice of a speaker from his own voice, the trained model better exploits the visual input and generalizes well to different noise types. The proposed model outperforms prior audio visual methods on two public lipreading datasets. It is also the first to be demonstrated on a dataset not designed for lipreading, such as the weekly addresses of Barack Obama.

Original languageEnglish
Pages (from-to)1170-1174
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
StatePublished - 2018
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2 Sep 20186 Sep 2018

Bibliographical note

Publisher Copyright:
© 2018 International Speech Communication Association. All rights reserved.

Keywords

  • Speech enhancement
  • Visual speech processing

Fingerprint

Dive into the research topics of 'Visual speech enhancement'. Together they form a unique fingerprint.

Cite this