MONAURAL SPEECH ENHANCEMENT SAMPLES FROM DUAL-STAGED CONTEXT AGGREGATION NETWORK

In monaural speech enhancement, an end-to-end deep neural network converts a noisy speech signal to a clean speech directly in the time domain without time-frequency transformation or mask estimation. However, aggregating contextual information with a high sampling rate in the time domain with an affordable model complexity still remains challenging. In this paper, we propose a densely connected convolutional and recurrent network, a hybrid architecture of densely connected convolutional networks (DenseNet) and gated recurrent units (GRU) to enable dual-level temporal context aggregation. With a careful design, the proposed model benefits from the temporal context in the signals, but with an affordable model complexity thanks to the dense connectivity in between layers. Experimental results show a consistent and noticeable improvement in various metrics against competing convolutional networks with large receptive fields.

Models -5 dB 0 dB +5 dB
Unprocessed
Gated ResNet
DenseNet
Dilated DenseNet
DenseNet+GRU
Ours
Clean speech
Models -5 dB 0 dB +5 dB
Unprocessed
Gated ResNet
DenseNet
Dilated DenseNet
DenseNet+GRU
Ours
Clean speech
Models -5 dB 0 dB +5 dB
Unprocessed
Gated ResNet
DenseNet
Dilated DenseNet
DenseNet+GRU
Ours
Clean speech
Models -5 dB 0 dB +5 dB
Unprocessed
Gated ResNet
DenseNet
Dilated DenseNet
DenseNet+GRU
Ours
Clean speech
Models -5 dB 0 dB +5 dB
Unprocessed
Gated ResNet
DenseNet
Dilated DenseNet
DenseNet+GRU
Ours
Clean speech