VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement
Chenye Cui (Zhejiang University); Zhou Zhao (Zhejiang University); Yi Ren (Bytedance); Jinglin Liu (Zhejiang University); Rongjie Huang (Zhejiang University); chen feiyang (huawei); Zhefeng Wang (Huawei Cloud); Baoxing Huai (Huawei Cloud); Fei Wu (Zhejiang University, China)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Video-to-sound generation aims to generate realistic and natural sound given a video input.
However, previous video-to-sound generation methods can only generate a random or average timbre without any controls of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we propose the task of generating sound with a specific timbre given a silent video input and a reference audio sample.
To solve this task, we first use three encoders to disentangle each target sound audio into temporal, acoustic, and background information respectively, then we use a decoder to reconstruct the audio given these disentangled representations. To make the generated result achieve better quality and temporal alignment, we also adopt a mel discriminator and a temporal discriminator for the adversarial training.
Our experimental results on the VAS dataset demonstrate that our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.