T-9: Distant conversational speech recognition and analysis: Recent advances, and trends towards end-to-end optimization
Keisuke Kinoshita, Yusuke Fujita, Naoyuki Kanda, Shinji Watanabe
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 03:43:32
Recognizing unsegmented conversational speech recorded with distant microphone(s) is a challenging but an essential task to be solved to unfold a myriad of new speech applications, such as a communication agent that can understand, respond to and facilitate our conversation. This task contains a number of subtasks, which has been studied rather independently for a decade, such as multichannel/single-channel source separation, speaker diarization with source number counting, and conversational speech recognition. This tutorial first revisits, with demonstration, current state-of-the-art systems for this task, which were developed for challenges such as CHiME 5-6 challenges, and commercial products. These systems typically consist of a combination of well-established independently optimized modules. While these systems are designed carefully to consolidate these independent modules, there is still a large room for improvement. In the latter part of the tutorial, we introduce a recent new research trend that aims to establish an optimal joint neural system that solves those subtasks all together, through end-to-end optimization based on common integrated objective. By showing the potential of such jointly-optimal systems that now start outperforming previous top-performing systems in many tasks, we discuss the future directions and challenges for this task from both industry and academic perspectives.