nVOC-22: A low cost Mel Spectrogram vocoder for mobile devices
Rakesh Iyer (Google Inc)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
A new neural network architecture is proposed that can be used to convert Mel
spectrograms into an audio signal. The architecture is designed from the ground up
to be run on a mobile device, taking advantage of operators that can be parallelized
easily on mobile CPUs and GPUs, being fully convolutional and non-autoregressive.
It introduces a lightweight combination of a nearest neighbor resize and separable
convolution as its upsampling block, that provides fast upsampling with minimal
checkerboarding artifacts. The model is trained as a GAN and demonstrates stable
training behavior. A method for evaluating the performance characteristics of neural
vocoders on mobile devices is also described. The model is shown to be able to run
at up to 20x faster than realtime on a current generation mobile CPU and up to 65x
faster than realtime on a current generation mobile GPU, while being neutral or
better in quality when evaluated against a comparably sized WaveRNN model.