Role of Lexical Boundary Information in Chunk-Level Segmentation for Speech Emotion Recognition
Wei-Cheng Lin (The University of Texas at Dallas); Carlos Busso (University of Texas at Dallas)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Chunk-level speech emotion recognition (SER) is a common modeling scheme to obtain better recognition performance than sentence level formulations. A key open question is the role of lexical boundary information in the process of splitting a sentence into small chunks. Is there any benefit in providing precise lexical boundary information to segment the speech into chunks (e.g., word-level alignments)? This study analyzes the role of lexical boundary information by exploring alternative segmentation strategies for chunk-level SER. We compare six chunk-level segmentation strategies that either consider word-level alignments or traditional time-based segmentation methods by varying the number of chunks and the duration of the chunks. We conduct extensive experiments to evaluate these chunk-level segmentation approaches using multiples corpora, and multiple acoustic feature sets. The results show a minor contribution of the word-level timing boundaries, where centering the chunks around words does not lead to significant performance gains. Instead, the critical factor to effectively segment a sentence into data chunks is to define the number of chunks according to the number of spoken words in the sentence.