LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

Xun Gong (Shanghai Jiaotong University); Yu Wu (Microsoft Research Asia); Jinyu Li (Microsoft); Shujie Liu (Microsoft Research Asia); Rui Zhao (Microsoft); Xie Chen (Shanghai Jiaotong University); Yanmin Qian (Shanghai Jiao Tong University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the LongFNT architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction, respectively.

Tags:

language modeling

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

Xun Gong (Shanghai Jiaotong University); Yu Wu (Microsoft Research Asia); Jinyu Li (Microsoft); Shujie Liu (Microsoft Research Asia); Rui Zhao (Microsoft); Xie Chen (Shanghai Jiaotong University); Yanmin Qian (Shanghai Jiao Tong University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Large-Scale and Parameter-Efficient Language Modeling for Speech Processing

HAG: Hierarchical Attention with Graph Network for Dialogue Act Classification in Conversation

Enhancing Unsupervised Speech Recognition with Diffusion GANs

Join the IEEE Signal Processing Society