Cross-Training: A Semi-Supervised Training Scheme for Speech Recognition

Soheil Khorram (Google Inc. USA); Anshuman Tripathi (Google); Jaeyoung Kim (Google); Han Lu (Google Inc. USA); Qian Zhang (Google Inc. USA); Rohit Prabhavalkar (Google); Hasim Sak (Google)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Semi-supervised training can be performed by jointly optimizing supervised and unsupervised losses. In many settings, supervised and unsupervised losses are inconsistent, and this inconsistency creates instability in training. As a solution, we propose cross-training: instead of training one network with two losses, we train two separate networks, each with a different loss; we then tie the parameters of the networks by minimizing an additional L2 loss between the parameters. This L2 loss acts as a knowledge bridge between the networks. It forces the networks to be similar; therefore both can learn from each other. This paper introduces the cross-training scheme to develop a stable contrastive siamese (c-siam) network. Our experiments on LibriSpeech and Google’s Voice-Search/YouTube datasets show that (1) cross-training provides 20% relative WER improvement over the SOTA systems on the LibriSpeech dataset; (2) cross-training stabilizes c-siam training and significantly outperforms SOTA systems on small supervised datasets; (3) cross-training is effective for cascaded encoders, unlike the original c-siam which shows weak convergence characteristics.

Tags:

Large vocabulary continuous speech recognition/search

Cross-Training: A Semi-Supervised Training Scheme for Speech Recognition

Soheil Khorram (Google Inc. USA); Anshuman Tripathi (Google); Jaeyoung Kim (Google); Han Lu (Google Inc. USA); Qian Zhang (Google Inc. USA); Rohit Prabhavalkar (Google); Hasim Sak (Google)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ROBUST ACOUSTIC AND SEMANTIC CONTEXTUAL BIASING IN NEURAL TRANSDUCERS FOR SPEECH RECOGNITION

Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

Join the IEEE Signal Processing Society