token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Xianghu Yue (National University of Singapore ); Junyi Ao (The Chinese University of Hong Kong (Shenzhen)); Xiaoxue Gao (National University of Singapore); Haizhou Li (The Chinese University of Hong Kong (Shenzhen))

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a speech-text joint pre-training on unpaired speech and text. In this paper, we take the idea of self-supervised pre-training one step further and propose token2vec, a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech. Specifically, we introduce two modality-specific tokenizers for speech and text. Based on these tokenizers, we convert speech/text sequences into discrete speech/text token sequences consisting of similar language units, thus mitigating the domain mismatch problem and length mismatch problem, which are caused by the distinct characteristics between speech and text. Finally, we feed the discrete speech and text tokens into a modality-agnostic Transformer encoder and pre-train with token-level masking language modeling (tMLM). Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction. Token2vec model is also validated on a non-ASR task, i.e., spoken intent classification, and shows good transferability.

Tags:

New algorithms and approaches for speech recognition

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Xianghu Yue (National University of Singapore ); Junyi Ao (The Chinese University of Hong Kong (Shenzhen)); Xiaoxue Gao (National University of Singapore); Haizhou Li (The Chinese University of Hong Kong (Shenzhen))

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

Noise-aware target extension with self-distillation for robust speech recognition

PRACTICE OF THE CONFORMER ENHANCED AUDIO-VISUAL HUBERT ON MANDARIN AND ENGLISH

Join the IEEE Signal Processing Society