MODELING TURN-TAKING IN HUMAN-TO-HUMAN SPOKEN DIALOGUE DATASETS USING SELF-SUPERVISED FEATURES
Edmilson da Silva Morais (IBM Research Brazil); Matheus Damasceno (IBM Research); Hagai Aronowitz (IBM Research - AI); Aharon Satt (IBM Research ); Ron Hoory (IBM Research)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Self-supervised pre-trained models have consistently delivered state-of-art results in the fields of natural language and speech processing. However, we argue that their merits for modeling Turn-Taking for spoken dialogue systems still need further investigation. Due to that, in this paper we introduce a modular End-to-End system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features to model the specific Turn-Taking task of End-of-Turn Detection (EOTD). Several architectures to model the EOTD task using audio-only, text-only and audio+text modalities are presented, and their performance and robustness are carefully evaluated for three different human-to-human spoken dialogue datasets. The proposed model not only achieves SOTA results for EOTD, but also brings light to the possibility of powerful and well fine-tuned self-supervised models to be successfully used for a wide variety Turn-Taking tasks.