Dynamic TF-TDNN: Dynamic Time Delay Neural Network based on Temporal-Frequency Attention for Dialect Recognition
Chao Liao (Kuaishou); Jinwen Huang (Kuaishou Technology); Huan Yuan (Kuaishou Technology); Peng Yao (Kuaishou Inc.); Jianchao Tan (Kwai Inc.); zhang dawei (Kuaishou Technology); Feng Deng (Kuaishou); Xiaorui Wang (Kwai); Chengru Song (Kuaishou)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Dialect recognition aims to recognize dialect categories in utterances, which has been applied in many audio applications. Recently, various Time Delayed Neural Network (TDNN) based AI models are proposed to solve dialect recognition problems, such as D-TDNN, DMC-TDNN, and ECAPA-TDNN, however, most of them only perform temporal attention in the last statistical pooling layer of the TDNN network, which ignores the importance of simultaneously capturing both frequency and temporal key information in utterances under different receptive fields. In contrast, we introduce a hybrid attention mechanism in both the temporal and frequency domain, called the TF-attention module, which adaptively pays more attention to the indeed important frames and the frame-level important information under different receptive fields for dialect recognition. Moreover, we are the first to introduce a dynamic architecture mechanism in the field of dialect recognition to dynamically reduce the computational cost and the number of parameters of models. We evaluate the proposed dynamic TF-TDNN on the OLR challenge AP20-OLR-dialect task and achieve State-Of-The-Art (SOTA) performance with fewer model parameters.