VQ-CL: Learning disentangled speech representations with contrastive learning and vector quantization
Huaizhen Tang (University of Science and Technology of China); Xulong Zhang (Ping An Technology (Shenzhen) Co., Ltd.); Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd); Ning Cheng (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Voice Conversion(VC) refers to converting the voice characteristics of audio to another one as is said by other people. Recently, more and more studies have focused on disentangle-based VC, which separates the timbre and linguistic content information from a speech signal to effectively achieve VC tasks. However, It's still challenging to extract phoneme-level features from frame-level hidden representations. This paper proposed a novel zero-shot voice conversion framework that utilizes contrastive learning and vector quantization to encourage the frame-level hidden features closer to the phoneme-level linguistic information, called \textbf{VQ-CL}. All objective and subjective experiment results show that VQ-CL has better performance than previous studies in separating content and voice characteristics to improve the sound quality of generated speech.