Phone-to-audio alignment without text: A Semi-supervised Approach

Jian Zhu, David Jurgens, Cong Zhang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:07:43

12 May 2022

The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment. Code and pretrained models are available at https://github.com/lingjzhu/charsiu.

Tags:

automatic speech recognition

phone segmentation

deep learning

forced alignment

Phone-to-audio alignment without text: A Semi-supervised Approach

Jian Zhu, David Jurgens, Cong Zhang

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Invertible Neural Networks and their Applications

Slides: Invertible Neural Networks and their Applications

End-to-End Automatic Speech Recognition

Join the IEEE Signal Processing Society