DISENTANGLED FEATURE LEARNING FOR REAL-TIME NEURAL SPEECH CODING

Xue Jiang (Communication University of China); Xiulian Peng (Microsoft Research Asia); Yuan Zhang (Communication University of China); Yan Lu (Microsoft Research Asia)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specifically, more global-like speaker identity and local content features are learned with disentanglement to represent speech. Such a compact feature decomposition not only achieves better coding efficiency by exploiting bit allocation among different features but also provides the flexibility to do audio editing in embedding space, such as voice conversion in real-time communications. Both subjective and objective results demonstrate its coding efficiency and we find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models with far less parameters and low latency, showing the potential of our neural coding framework.

Tags:

Bioacoustics and medical acoustics

DISENTANGLED FEATURE LEARNING FOR REAL-TIME NEURAL SPEECH CODING

Xue Jiang (Communication University of China); Xiulian Peng (Microsoft Research Asia); Yuan Zhang (Communication University of China); Yan Lu (Microsoft Research Asia)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Piecewise position encoding in convoutional neural network for cough-based COVID-19 detection

A Contrastive Embedding-based Domain Adaptation method for Lung Sound Recognition in Children Community-Acquired Pneumonia

Few-shot continual learning with weight alignment and positive enhancement for bioacoustic event detection

Join the IEEE Signal Processing Society