Skip to main content

Toward a Multimodal Approach for Disfluency Detection and Categorization

Amrit Romana (University of Michigan); Kazuhito Koishida (Microsoft)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
06 Jun 2023

Speech disfluencies, such as filled pauses, repetitions, or revisions, disrupt the typical flow of speech. Disfluency detection and categorization has gained traction as a research area because modeling disfluent events has been shown to be helpful for downstream tasks. However, the majority of work on disfluency detection and categorization has focused on language-based approaches that process manually transcribed text. While these methods have shown high accuracy, requiring manually transcribed text limits the scalability and practicality of these approaches. In this paper, we evaluate the impact of using automatic speech recognition (ASR) transcripts for disfluency detection and categorization. Additionally, we explore skipping transcription altogether with an acoustic approach. We assess the strengths and weaknesses of each modality, and in this paper we introduce a model fusion approach that combines the two modalities. We find that multimodal disfluency detection and categorization outperforms using either individual modality, and that the improvement in performance is especially significant when the language-based model processes ASR transcripts.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00