Skip to main content

MODELING TURN-TAKING IN HUMAN-TO-HUMAN SPOKEN DIALOGUE DATASETS USING SELF-SUPERVISED FEATURES

Edmilson da Silva Morais (IBM Research Brazil); Matheus Damasceno (IBM Research); Hagai Aronowitz (IBM Research - AI); Aharon Satt (IBM Research ); Ron Hoory (IBM Research)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

Self-supervised pre-trained models have consistently delivered state-of-art results in the fields of natural language and speech processing. However, we argue that their merits for modeling Turn-Taking for spoken dialogue systems still need further investigation. Due to that, in this paper we introduce a modular End-to-End system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features to model the specific Turn-Taking task of End-of-Turn Detection (EOTD). Several architectures to model the EOTD task using audio-only, text-only and audio+text modalities are presented, and their performance and robustness are carefully evaluated for three different human-to-human spoken dialogue datasets. The proposed model not only achieves SOTA results for EOTD, but also brings light to the possibility of powerful and well fine-tuned self-supervised models to be successfully used for a wide variety Turn-Taking tasks.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00