Abstract
Speechreading is the task of inferring phonetic information from visually observed articulatory facial movements, and is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person. We train our model on speakers from the GRID and TCD-TIMIT datasets, and evaluate the quality and intelligibility of reconstructed speech using common objective measurements. We show that speech predictions from the proposed model attain scores which indicate significantly improved quality over existing models. In addition, we show promising results towards reconstructing speech from an unconstrained dictionary.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 455-462 |
| Number of pages | 8 |
| ISBN (Electronic) | 9781538610343 |
| DOIs | |
| State | Published - 19 Jan 2018 |
| Event | 16th IEEE International Conference on Computer Vision Workshops, ICCVW 2017 - Venice, Italy Duration: 22 Oct 2017 → 29 Oct 2017 |
Publication series
| Name | Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 |
|---|---|
| Volume | 2018-January |
Conference
| Conference | 16th IEEE International Conference on Computer Vision Workshops, ICCVW 2017 |
|---|---|
| Country/Territory | Italy |
| City | Venice |
| Period | 22/10/17 → 29/10/17 |
Bibliographical note
Publisher Copyright:© 2017 IEEE.
Fingerprint
Dive into the research topics of 'Improved Speech Reconstruction from Silent Video'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver