Toward a Multimodal Approach for Disfluency Detection and Categorization
Amrit Romana (University of Michigan); Kazuhito Koishida (Microsoft)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Speech disfluencies, such as filled pauses, repetitions, or revisions, disrupt the typical flow of speech. Disfluency detection and categorization has gained traction as a research area because modeling disfluent events has been shown to be helpful for downstream tasks. However, the majority of work on disfluency detection and categorization has focused on language-based approaches that process manually transcribed text. While these methods have shown high accuracy, requiring manually transcribed text limits the scalability and practicality of these approaches. In this paper, we evaluate the impact of using automatic speech recognition (ASR) transcripts for disfluency detection and categorization. Additionally, we explore skipping transcription altogether with an acoustic approach. We assess the strengths and weaknesses of each modality, and in this paper we introduce a model fusion approach that combines the two modalities. We find that multimodal disfluency detection and categorization outperforms using either individual modality, and that the improvement in performance is especially significant when the language-based model processes ASR transcripts.