Skip to main content

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

Chenye Cui (Zhejiang University); Zhou Zhao (Zhejiang University); Yi Ren (Bytedance); Jinglin Liu (Zhejiang University); Rongjie Huang (Zhejiang University); chen feiyang (huawei); Zhefeng Wang (Huawei Cloud); Baoxing Huai (Huawei Cloud); Fei Wu (Zhejiang University, China)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
08 Jun 2023

Video-to-sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we propose the task of generating sound with a specific timbre given a silent video input and a reference audio sample. To solve this task, we first use three encoders to disentangle each target sound audio into temporal, acoustic, and background information respectively, then we use a decoder to reconstruct the audio given these disentangled representations. To make the generated result achieve better quality and temporal alignment, we also adopt a mel discriminator and a temporal discriminator for the adversarial training. Our experimental results on the VAS dataset demonstrate that our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00