Abstract
Speechreading is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. The proposed CNN generates sound features for each frame based on its neighboring frames. Waveforms are then synthesized from the learned speech features to produce intelligible speech. We show that by leveraging the automatic feature learning capabilities of a CNN, we can obtain state-of-the-art word intelligibility on the GRID dataset, and show promising results for learning out-of-vocabulary (OOV) words.
Original language | English |
---|---|
Title of host publication | 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 5095-5099 |
Number of pages | 5 |
ISBN (Electronic) | 9781509041176 |
DOIs | |
State | Published - 16 Jun 2017 |
Event | 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - New Orleans, United States Duration: 5 Mar 2017 → 9 Mar 2017 |
Publication series
Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
ISSN (Print) | 1520-6149 |
Conference
Conference | 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 |
---|---|
Country/Territory | United States |
City | New Orleans |
Period | 5/03/17 → 9/03/17 |
Bibliographical note
Publisher Copyright:© 2017 IEEE.
Keywords
- articulatory-to-acoustic mapping
- neural networks
- speech intelligibility
- Speechreading
- visual speech processing