Improved Speech Reconstruction from Silent Video

Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

79 Scopus citations

Abstract

Speechreading is the task of inferring phonetic information from visually observed articulatory facial movements, and is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person. We train our model on speakers from the GRID and TCD-TIMIT datasets, and evaluate the quality and intelligibility of reconstructed speech using common objective measurements. We show that speech predictions from the proposed model attain scores which indicate significantly improved quality over existing models. In addition, we show promising results towards reconstructing speech from an unconstrained dictionary.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages455-462
Number of pages8
ISBN (Electronic)9781538610343
DOIs
StatePublished - 1 Jul 2017
Event16th IEEE International Conference on Computer Vision Workshops, ICCVW 2017 - Venice, Italy
Duration: 22 Oct 201729 Oct 2017

Publication series

NameProceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017
Volume2018-January

Conference

Conference16th IEEE International Conference on Computer Vision Workshops, ICCVW 2017
Country/TerritoryItaly
CityVenice
Period22/10/1729/10/17

Bibliographical note

Publisher Copyright:
© 2017 IEEE.

Fingerprint

Dive into the research topics of 'Improved Speech Reconstruction from Silent Video'. Together they form a unique fingerprint.

Cite this