Better Together: Dialogue Separation and Voice Activity Detection for Audio Personalization in TV

Matteo Torcoli (International Audio Laboratories Erlangen); Emanuel Habets (AudioLabs Erlangen)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for the end user. Still, the estimated signals are not perfect, and some leakage can be introduced. This is undesired, especially during passages without dialogue. We propose to combine DS and Voice Activity Detection (VAD), both recently proposed for TV audio. When their combination suggests dialogue inactivity, background components leaking in the dialogue estimate are reassigned to the background estimate. A clear improvement of the audio quality is shown for dialogue-free signals, without performance drops when dialogue is active. A post-processed VAD estimate with improved detection accuracy is also generated. It is concluded that DS and VAD can improve each other and are better used together.

Tags:

Audio for multimedia and audio processing systems

Better Together: Dialogue Separation and Voice Activity Detection for Audio Personalization in TV

Matteo Torcoli (International Audio Laboratories Erlangen); Emanuel Habets (AudioLabs Erlangen)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Building Keyword Search System from End-to-End ASR Systems

MUSIC REARRANGEMENT USING HIERARCHICAL SEGMENTATION

Incorporating lip features into audio-visual multi-speaker DOA estimation by gated fusion

Join the IEEE Signal Processing Society