ABSTRACT

In this paper, we address a multichannel speech enhancement method based on wakeup word mask estimation using Deep Neural Network (DNN). It is thought that the wakeup word is an important clue for target speaker.We use a DNN to estimate the wakeup word mask and noise mask and apply them to separate the mixed wakeup word signal into target speaker’s speech and background noise. Convolutional Recurrent Neural Network (CRNN) is used to exploit both short and long term time-frequency dependencies of sequences such as speech signals. Generalized Eigen Vector (GEV) beamforming estimates the spatial filter by using the masks to enhance the following speech command of target speaker and reduce undesirable noise. Experiment results show that the proposal provides more robust to noise, so that improves the Signal-to-Noise Ratio (SNR) and speech recognition accuracy.

Keywords: - Multichannel Speech Enhancement, Wakeup word, Mask estimation, Beamforming, Deep Neural Network (DNN)