Next-speaker Prediction Based on Non-Verbal Information in Multi-party Video Conversation

Saki Mizuno (NTT Computer & Data Science Laboratories); Nobukatsu Hojo (NTT); Satoshi Kobashikawa (NTT Corporation); Ryo Masumura (NTT Corporation)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

We propose a method for next-speaker prediction, a task to predict who speaks in the next turn among multiple current listeners, in multi-party video conversation. Previous studies used non-verbal features, such as head movements and gaze behavior, for next-speaker prediction in face-to-face conversation. However, in video conversation, these non-verbal features are vague and ineffective because they look at the screen displaying other participants. Since non-verbal features include participant characteristics, it is necessary to use training data with rich combinations of participants to robustly predict the next speaker. Previous studies used training data with a limited number of combinations of participants because the data consist only of recorded data. Therefore, the proposed method uses 1) novel non-verbal features for next-speaker prediction in video conversation, specifically facial expressions, hand movements and speech segments, and 2) data augmentation of participant combinations in the training data. We conducted experiments to evaluate the proposed method, and the results using video-conversation data indicate its effectiveness.

Tags:

Machine/deep learning methodologies for multimedia

Next-speaker Prediction Based on Non-Verbal Information in Multi-party Video Conversation

Saki Mizuno (NTT Computer & Data Science Laboratories); Nobukatsu Hojo (NTT); Satoshi Kobashikawa (NTT Corporation); Ryo Masumura (NTT Corporation)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Abusive activity detection with multi-modality based on convolutional neural network

IMPROVING THE MODALITY REPRESENTATION WITH MULTI-VIEW CONTRASTIVE LEARNING FOR MULTIMODAL SENTIMENT ANALYSIS

Lyapunov-driven deep reinforcement learning for edge inference empowered by Reconfigurable Intelligent Surfaces

Join the IEEE Signal Processing Society