Optimal Transport in Diffusion Modeling for Conversion Tasks in Audio Domain
Vadim Popov (Huawei Noah's Ark Lab); Amantur Amatov (Huawei); Mikhail Kudinov (Huawei Noah's Ark Lab); Vladimir Gogoryan (Huawei Noah's Ark Lab); Tasnima Sadekova (Huawei Noah's Ark Lab); Ivan Vovk (Huawei Noah's Ark Lab)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Diffusion models have recently become a popular generative modeling framework in various domains because of their high-quality sampling capabilities. Lately, it has been hypothesized that optimally trained diffusion models supplied with specific differential equation solvers provide a solution to the optimal transport problem between the data distribution and the prior distribution. In this paper, we empirically show that applying the optimal transport point of view on diffusion modeling allows making a good choice of a noise sample the reverse diffusion starts generating from. We consider two audio-related tasks: voice conversion and timbre transfer. In the former, we improve upon the recent state-of-the-art model and demonstrate that the optimal transport helps us to keep the prosody of the source utterances significantly better than the vanilla diffusion-based model does. As for timbre transfer, we propose the novel diffusion model capable of many-to-many timbre transfer performing on par with common algorithms in terms of the overall music quality.