EFFICIENT STUTTERING EVENT DETECTION USING SIAMESE NETWORKS
Payal Mohapatra (Northwestern University); Bashima Islam (Worcester Polytechnic Institute); MD Tamzeed Islam (Amazon); Ruochen Jiao (Northwestern University); Zhu Qi (Northwestern University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Speech disfluency research is pivotal as conversational technology
and voice assistants have become commonplace. Stutter
detection is critical in accommodating atypical speakers in current
automatic speech recognition systems. However, the lack
of publicly available labeled and unlabeled datasets is a significant
bottleneck to this research. While many works use
pseudo dysfluency data with proxy labels and formulate a self-supervised
task, we see merit in using real-world data. We
consolidate the corpora of publicly available speech disfluency
datasets with and without labels and propose DisfluentSiam - a
simple siamese network-based small-scale pretraining pipeline
using task-specific data from multiple domains with only 10M
trainable parameters. We show that with DisfluentSiam, we
achieve an average of 15% boost in performance across five
types of dysfluency event detection compared to direct wav2vec
2.0 embeddings. Especially with only 4-5 mins of labeled data
for fine-tuning, the DisfluentSiam demonstrates the advantage
of task-specific pretraining with up to 25% higher accuracy.