STYLE MODELING FOR MULTI-SPEAKER ARTICULATION-TO-SPEECH

Miseul Kim (Yonsei University); Zhenyu Piao (Yonsei University); Jihyun Lee (yonsei university); Hong-Goo Kang (Yonsei University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

In this paper, we propose a neural articulation-to-speech (ATS) framework that synthesizes high-quality speech from articulatory signal in a multi-speaker situation. Most conventional ATS approaches only focus on modeling contextual information of speech from a single speaker's articulatory features. To explicitly represent each speaker's speaking style as well as the contextual information, our proposed model estimates style embeddings, guided from the essential speech style attributes such as pitch and energy. We adopt convolutional layers and transformer-based attention layers for our model to fully utilize both local and global information of articulatory signals, measured by electromagnetic articulography (EMA). Our model significantly improves the quality of synthesized speech compared to the baseline in terms of objective and subjective measurements in the Haskins dataset.

Tags:

Multimodal processing of language

STYLE MODELING FOR MULTI-SPEAKER ARTICULATION-TO-SPEECH

Miseul Kim (Yonsei University); Zhenyu Piao (Yonsei University); Jihyun Lee (yonsei university); Hong-Goo Kang (Yonsei University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Exploring complementary features in multi-modal speech emotion recognition

DAIS: THE DELFT DATABASE OF EEG RECORDINGS OF DUTCH ARTICULATED AND IMAGINED SPEECH

Join the IEEE Signal Processing Society