Multi-Quartznet: Multi-Resolution Convolution For Speech Recognition With Multi-Layer Feature Fusion

Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 0:14:42

19 Jan 2021

In this paper, we propose an end-to-end speech recognition network based on Nvidia's previous QuartzNet model. We try to promote the model performance, and design three components: (1) Multi-Resolution Convolution Module, replaces the original 1D time-channel separable convolution with multi-stream convolutions. And each stream has a unique dilated stride on convolutional operations. (2) Channel-Wise Attention Module, calculates the attention weight of each convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature Fusion Module, reweights each convolutional block by global multi-layer feature maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.

Tags:

sps conference

slt 2021