Multi-Quartznet: Multi-Resolution Convolution For Speech Recognition With Multi-Layer Feature Fusion
Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:14:42
In this paper, we propose an end-to-end speech recognition network based on Nvidia's previous QuartzNet model. We try to promote the model performance, and design three components: (1) Multi-Resolution Convolution Module, replaces the original 1D time-channel separable convolution with multi-stream convolutions. And each stream has a unique dilated stride on convolutional operations. (2) Channel-Wise Attention Module, calculates the attention weight of each convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature Fusion Module, reweights each convolutional block by global multi-layer feature maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.