-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 01:34:10
The field of automatic speech recognition (ASR) is now dominated by the end-to-end (E2E) models that directly map speech to text. In this talk, we will give an overview of the E2E ASR models and introduce the recent progress from an industry perspective. To design an E2E model that has high accuracy and low latency, a masking strategy was applied to Transformer Transducer. We will discuss technologies that can use text-only data for general model training through pretraining and adaptation to a new domain through augmentation and factorization. We will also discuss how to build multilingual ASR models to serve all the users. Then, we will extend E2E modeling for streaming multi-speaker ASR. Finally, we will end the talk with some new research opportunities we can explore.