Skip to main content

Speech summarization of long spoken document: Improving memory efficiency of speech/text encoders

Takatomo Kano (NTT Corporation); Atsunori Ogawa (NTT Corporation); Marc Delcroix (NTT); Roshan S Sharma (Carnegie Mellon University); Kohei Matsuura (NTT); Shinji Watanabe (Carnegie Mellon University)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
06 Jun 2023

Speech summarization requires processing several minute-long speech sequences to allow exploiting the whole context of a spoken document. A conventional approach is a cascade of automatic speech recognition (ASR) and text summarization (TS). However, the cascade systems are sensitive to ASR errors. Moreover, the cascade system cannot be optimized for input speech and utilize para-linguistic information. Recently, there has been an increased interest in end-to-end (E2E) approaches optimized to output summaries directly from speech. Such systems can thus mitigate the ASR errors of cascade approaches. However, E2E speech summarization requires massive computational resources because it needs to encode long speech sequences. We propose a speech summarization system that enables E2E summarization from 100 seconds, which is the limit of the conventional method, to up to 10 minutes (i.e., the duration of typical instructional videos on YouTube). However, the modeling capability of this model for minute-long speech sequences is weaker than the conventional approach. We thus exploit auxiliary text information from ASR transcriptions to improve the modeling capabilities. The resultant system consists of a dual speech/text encoder decoder-based summarization system. We perform experiments on the How2 dataset showing the proposed system improved METEOR scores by up to 2.7 points by fully exploiting the long spoken documents.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00