Toward a Multimodal Approach for Disfluency Detection and Categorization

Amrit Romana (University of Michigan); Kazuhito Koishida (Microsoft)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Speech disfluencies, such as filled pauses, repetitions, or revisions, disrupt the typical flow of speech. Disfluency detection and categorization has gained traction as a research area because modeling disfluent events has been shown to be helpful for downstream tasks. However, the majority of work on disfluency detection and categorization has focused on language-based approaches that process manually transcribed text. While these methods have shown high accuracy, requiring manually transcribed text limits the scalability and practicality of these approaches. In this paper, we evaluate the impact of using automatic speech recognition (ASR) transcripts for disfluency detection and categorization. Additionally, we explore skipping transcription altogether with an acoustic approach. We assess the strengths and weaknesses of each modality, and in this paper we introduce a model fusion approach that combines the two modalities. We find that multimodal disfluency detection and categorization outperforms using either individual modality, and that the improvement in performance is especially significant when the language-based model processes ASR transcripts.

Tags:

Multimodal processing of language

Toward a Multimodal Approach for Disfluency Detection and Categorization

Amrit Romana (University of Michigan); Kazuhito Koishida (Microsoft)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Exploring complementary features in multi-modal speech emotion recognition

Egocentric Audio-Visual Noise Suppression

Join the IEEE Signal Processing Society