PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

Muqiao Yang (Carnegie Mellon University); Joseph Konan (Carnegie Mellon University); David Bick (Carnegie Mellon University); Yunyang Zeng (Carnegie Mellon University); Shuo Han (Carnegie Mellon University); Anurag Kumar (Facebook Research); Shinji Watanabe (Carnegie Mellon University); Bhiksha Raj (Carnegie Mellon University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non-differentiable, and we develop a neural network estimator that can accurately predict their time-series values across an utterance. We also model phoneme-specific weights for each feature, as the acoustic parameters are known to show different behavior in different phonemes. We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features. Experimentally we show that it improves speech enhancement workflows in both time-domain and time-frequency domain, as measured by standard evaluation metrics. We also provide an analysis of phoneme-dependent improvement on acoustic parameters, demonstrating the additional interpretability that our method provides. This analysis can suggest which features are currently the bottleneck for improvement.

Tags:

Speech enhancement and separation

PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing Audio-Visual Speech Enhancement

SINGLE-CHANNEL SPEECH ENHANCEMENT WITH DEEP COMPLEX U-NETWORKS AND PROBABILISTIC LATENT SPACE MODELS

Fast and Efficient Speech Enhancement with Variational Autoencoders

Join the IEEE Signal Processing Society