CSENET: COMPLEX SQUEEZE-AND-EXCITATION NETWORK FOR SPEECH DEPRESSION LEVEL PREDICTION

Cunhang Fan, Zhao Lv, Shengbing Pei, Mingyue Niu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:07:59

10 May 2022

Automatic speech depression level prediction (SDLP) is a very challenging problem in affective computing. There are many studies have acquired quite good performances for SDLP. However, most of the input speech features of these studies are based on the amplitude spectrogram, which loses the phase spectrogram information. Therefore, these speech features may lose some important information related to depression. In order to make full use of speech information, this paper proposes a complex squeeze-and-excitation network (CSENet) for SDLP. The complex spectrogram is used as the input speech feature, which contains both amplitude and phase spectrogram. In addition, to acquire a discriminative feature, the squeeze-and-excitation residual network is employed to extract deep speech feature. Finally, the attentive temporal pooling is utilized to dynamically select more important information according to the attention mechanisms. Experimental results on the AVEC 2013 and AVEC 2014 datasets prove the effectiveness of our proposed method. As for the mean absolute error (MAE) evaluation metric on AVEC 2013, our proposed method acquires state-of-the-art performance.

Tags:

attentive temporal pooling

senet

speech depression level prediction

complex spectrogram