Speech acoustic modeling from raw multichannel waveforms

Yedid Hoshen, Ron J. Weiss, Kevin W. Wilson

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

172 Scopus citations

Abstract

Standard deep neural network-based acoustic models for automatic speech recognition (ASR) rely on hand-engineered input features, typically log-mel filterbank magnitudes. In this paper, we describe a convolutional neural network -deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, i.e. without any preceding feature extraction, and learns a similar feature representation through supervised training. By operating directly in the time domain, the network is able to take advantage of the signal's fine time structure that is discarded when computing filterbank magnitude features. This structure is especially useful when analyzing multichannel inputs, where timing differences between input channels can be used to localize a signal in space. The first convolutional layer of the proposed model naturally learns a filterbank that is selective in both frequency and direction of arrival, i.e. a bank of bandpass beamformers with an auditory-like frequency scale. When trained on data corrupted with noise coming from different spatial locations, the network learns to filter them out by steering nulls in the directions corresponding to the noise sources. Experiments on a simulated multichannel dataset show that the proposed acoustic model outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions.

Original languageEnglish
Title of host publication2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4624-4628
Number of pages5
ISBN (Electronic)9781467369978
DOIs
StatePublished - 4 Aug 2015
Event40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Brisbane, Australia
Duration: 19 Apr 201424 Apr 2014

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2015-August
ISSN (Print)1520-6149

Conference

Conference40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015
Country/TerritoryAustralia
CityBrisbane
Period19/04/1424/04/14

Bibliographical note

Publisher Copyright:
© 2015 IEEE.

Keywords

  • Automatic speech recognition
  • acoustic modeling
  • beamforming
  • convolutional neural networks

Fingerprint

Dive into the research topics of 'Speech acoustic modeling from raw multichannel waveforms'. Together they form a unique fingerprint.

Cite this