AN INVESTIGATION OF STREAMING NON-AUTOREGRESSIVE SEQUENCE-TO-SEQUENCE VOICE CONVERSION

Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:45

09 May 2022

Recent advances in sequence-to-sequence (S2S) models have improved the quality of voice conversion (VC), but it requires the entire sequence to perform inference, which prevents using it in real-time applications. To address this issue, this paper extends the non-autoregressive (NAR) S2S-VC model to enable us to perform streaming VC. We introduce streamable architectures such as causal convolution and self-attention with causal masking for the FastSpeech2-based NAR-S2S-VC model. The streamable architecture also tries to convert durations, which are kept as is in conventional real-time VC methods. To further improve the performance of the streaming VC model, we utilize an instant knowledge distillation with a dual-mode architecture, which performs non-causal and causal inference by sharing the network parameters. Through the experimental evaluation with Japanese parallel corpus, we investigate the impact on performance caused by the streamable architecture. The experimental results reveal that the use of future context frames increases latency, but it improves the conversion quality and that the difference in the speaking rate affects the performance of streaming inference.

Tags:

voice conversion

non-autoregressive

sequence-to-sequence

streaming

AN INVESTIGATION OF STREAMING NON-AUTOREGRESSIVE SEQUENCE-TO-SEQUENCE VOICE CONVERSION

Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Short Course Bundle: ICASSP 2022 COURSE 5: Speech Technology for Health: From Technical Foundations to Applications (Parts 1-3)

UNSUPERVISED DOMAIN ADAPTATION WITH IMBALANCED CHARACTER DISTRIBUTION FOR SCENE TEXT RECOGNITION

USTED: IMPROVING ASR WITH A UNIFIED SPEECH AND TEXT ENCODER-DECODER

Join the IEEE Signal Processing Society