Whether Contribution of Features Differ Between Video-mediated and In-person Meetings in Important Utterance Estimation

Fumio Nihei (NTT); Ryo Ishii (NTT); Yukiko Nakano (Seikei Univeristy); Atsushi Fukayama (NTT); Takao Nakamura (NTT)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

This study investigated differences in the contributions of various features to in-person (IP) and video-mediated (VM) meetings. We focused on estimating important utterances using both an IP and a VM meeting corpora as the analysis data. A transformer model with dialogue history was used to estimate important utterances, and five types of input (text, speaker's audio, others' audio, speaker's video, and others' video) were fed to the model. A comparison of the models for IP and VM revealed that the speaker's audio has a strong effect on the IP model, the video of the other participants strongly affects the VM model, and the text and others' audio strongly affects both models in estimating important utterances.

Tags:

Multi-modal signal processing and analysis (audio/visual/haptics/radar/lidar etc.)