Abstract
Speechreading is the task of inferring phonetic information from visually observed articulatory facial movements, and is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person. We train our model on speakers from the GRID and TCD-TIMIT datasets, and evaluate the quality and intelligibility of reconstructed speech using common objective measurements. We show that speech predictions from the proposed model attain scores which indicate significantly improved quality over existing models. In addition, we show promising results towards reconstructing speech from an unconstrained dictionary.
Original language | English |
---|---|
Title of host publication | Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 455-462 |
Number of pages | 8 |
ISBN (Electronic) | 9781538610343 |
DOIs | |
State | Published - 1 Jul 2017 |
Event | 16th IEEE International Conference on Computer Vision Workshops, ICCVW 2017 - Venice, Italy Duration: 22 Oct 2017 → 29 Oct 2017 |
Publication series
Name | Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 |
---|---|
Volume | 2018-January |
Conference
Conference | 16th IEEE International Conference on Computer Vision Workshops, ICCVW 2017 |
---|---|
Country/Territory | Italy |
City | Venice |
Period | 22/10/17 → 29/10/17 |
Bibliographical note
Publisher Copyright:© 2017 IEEE.