token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text
Xianghu Yue (National University of Singapore ); Junyi Ao (The Chinese University of Hong Kong (Shenzhen)); Xiaoxue Gao (National University of Singapore); Haizhou Li (The Chinese University of Hong Kong (Shenzhen))
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a speech-text joint pre-training on unpaired speech and text.
In this paper, we take the idea of self-supervised pre-training one step further and propose token2vec, a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech.
Specifically, we introduce two modality-specific tokenizers for speech and text. Based on these tokenizers, we convert speech/text sequences into discrete speech/text token sequences consisting of similar language units, thus mitigating the domain mismatch problem and length mismatch problem, which are caused by the distinct characteristics between speech and text.
Finally, we feed the discrete speech and text tokens into a modality-agnostic Transformer encoder and pre-train with token-level masking language modeling (tMLM).
Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction. Token2vec model is also validated on a non-ASR task, i.e., spoken intent classification, and shows good transferability.