PHONE-INFORMED REFINEMENT OF SYNTHESIZED MEL SPECTROGRAM FOR DATA AUGMENTATION IN SPEECH RECOGNITION

Sei Ueno, Tatsuya Kawahara

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:47

13 May 2022

While recent end-to-end automatic speech recognition (ASR) models achieve high performance, we need to prepare an abundant amount of training data. To mitigate the lack of training data, text-to-speech systems have been utilized to leverage text-only data to efficiently generate paired data for training the ASR model. The widely-used procedure first generates a Mel spectrogram from text data, then converts it into a waveform, and converts it again to a Mel spectrogram. The vocoder is used to alleviate the difference between real and synthesized speech, but it requires a huge amount of runtime. In this work, we propose a phone-informed post-processing network that refines Mel spectrograms without using the vocoder. The proposed network consumes not only Mel spectrograms but also text information of the speech for phone-informed refinement. Experimental evaluations demonstrate that the proposed network achieves better WERs than the vocoder network in an English domain adaptation task (LibriSpeech to TED-LIUM 2; read speech to spontaneous speech) in a much smaller amount of data generation time, and the use of phone information is critical for the improvement. We also confirm the effect of the proposed model in a Japanese domain adaptation task (CSJ-SPS to CSJ-APS; everyday topic to academic topic).

Tags:

speech recognition

transformer

speech synthesis

domain adaptation

fastspeech 2

PHONE-INFORMED REFINEMENT OF SYNTHESIZED MEL SPECTROGRAM FOR DATA AUGMENTATION IN SPEECH RECOGNITION

Sei Ueno, Tatsuya Kawahara

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Tutorial: Foundational Problems in Neural Speech Recognition

Conversational Speech Processing and Recognition: Speech Separation, End-to-End Modeling, and Speaker Diarization

Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

Join the IEEE Signal Processing Society