Skip to main content

CN-CVS: A Mandarin Audio-Visual Dataset for Large Vocabulary Continuous Visual to Speech Synthesis

Chen Chen (Tsinghua University); Dong Wang (Tsinghua University); Thomas Fang Zheng ("CSLT, Tsinghua University")

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

Research on Video to Speech Synthesis (VTS) surges recently and the focus is gradually shifting from small-vocabulary short-phrase VTS to large-vocabulary continuous VTS (LVC-VTS). A large-scale dataset with sufficient speakers and utterances is prerequisite for such research, and the database is certainly language dependent. In this paper, we introduce CN-CVS, a large-scale Mandarin continuous visual-speech dataset, to support LVC-VTS research. The dataset contains about 200k utterances from more than 2500 individuals, amounting to more than 300 hours of visual-speech data. We built a state-of-the-art VTS model with the new dataset and conducted preliminary studies. Our results show that models that achieve good performance on small vocabulary tasks may perform very poor on CN-CVS, indicating that continuous VTS is indeed a challenging task, and the main challenge comes from the unconstrained vocabulary. The dataset and baseline code can be downloaded for free from http://cncvs.cslt.org.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00