VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

Chenye Cui (Zhejiang University); Zhou Zhao (Zhejiang University); Yi Ren (Bytedance); Jinglin Liu (Zhejiang University); Rongjie Huang (Zhejiang University); chen feiyang (huawei); Zhefeng Wang (Huawei Cloud); Baoxing Huai (Huawei Cloud); Fei Wu (Zhejiang University, China)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

Video-to-sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we propose the task of generating sound with a specific timbre given a silent video input and a reference audio sample. To solve this task, we first use three encoders to disentangle each target sound audio into temporal, acoustic, and background information respectively, then we use a decoder to reconstruct the audio given these disentangled representations. To make the generated result achieve better quality and temporal alignment, we also adopt a mel discriminator and a temporal discriminator for the adversarial training. Our experimental results on the VAS dataset demonstrate that our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.

Tags:

Machine/deep learning methodologies for multimedia

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

Chenye Cui (Zhejiang University); Zhou Zhao (Zhejiang University); Yi Ren (Bytedance); Jinglin Liu (Zhejiang University); Rongjie Huang (Zhejiang University); chen feiyang (huawei); Zhefeng Wang (Huawei Cloud); Baoxing Huai (Huawei Cloud); Fei Wu (Zhejiang University, China)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Abusive activity detection with multi-modality based on convolutional neural network

IMPROVING THE MODALITY REPRESENTATION WITH MULTI-VIEW CONTRASTIVE LEARNING FOR MULTIMODAL SENTIMENT ANALYSIS

Lyapunov-driven deep reinforcement learning for edge inference empowered by Reconfigurable Intelligent Surfaces

Join the IEEE Signal Processing Society