Transformer-Based Acoustic Modeling For Hybrid Speech Recognition
Abdelrahman Mohamed, Duc Le, Chunxi Liu, Yongqiang Wang, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Alex Xiao, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, Michael L. Seltzer
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 15:12
We propose and evaluate transformer-based acoustic models for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep transformers. We also present a preliminary study of using limited right context in transformer, which allows it to be used in streaming applications. We demonstrated that transformer-based acoustic models can outperform very strong LSTM-based acoustic models: on a widely used librispeech benchmark, our transformer-based acoustic model outperforms a previous best acoustic model by 18.8% to 26.4% when the standard $n$-gram language model (LM) is used; state-of-art results on librispeech benchmark were achieved using our transformer-based acoustic model with a neural network LM for re-scoring. Our findings are also confirmed on a large-scale internal dataset.