Leveraging Large Text Corpora for End-to-End Speech Summarization

Kohei Matsuura (NTT); Takanori Ashihara (NTT Corp.); Takafumi Moriya (NTT); Tomohiro Tanaka (NTT); Marc Delcroix (NTT); Atsunori Ogawa (NTT Corporation); Ryo Masumura (NTT Corporation)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large amount of paired data (i.e., speech and summary) is difficult, the training data is usually insufficient to train a robust E2E SSum system. In this paper, we present two novel methods that leverage a large amount of external text summarization data for E2E SSum training. The first technique is to utilize a text-to-speech (TTS) system to generate synthesized speech, which is used for E2E SSum training with the text summary. The second is a TTS-free method that directly inputs phoneme sequence instead of synthesized speech to the E2E SSum model. Experiments show that our proposed TTS- and phoneme-based methods improve several metrics on the How2 dataset. In particular, our best system outperforms a previous state-of-the-art one by a large margin (i.e., METEOR score improvements of more than 6 points). To the best of our knowledge, this is the first work to use external language resources for E2E SSum. Moreover, we report a detailed analysis of the How2 dataset to confirm the validity of our proposed E2E SSum system.

Tags:

Spoken document retrieval and written text mining

Leveraging Large Text Corpora for End-to-End Speech Summarization

Kohei Matsuura (NTT); Takanori Ashihara (NTT Corp.); Takafumi Moriya (NTT); Tomohiro Tanaka (NTT); Marc Delcroix (NTT); Atsunori Ogawa (NTT Corporation); Ryo Masumura (NTT Corporation)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

M3ST: MIX AT THREE LEVELS FOR SPEECH TRANSLATION

FULLY UNSUPERVISED TOPIC CLUSTERING OF UNLABELLED SPOKEN AUDIO USING SELF-SUPERVISED REPRESENTATION LEARNING AND TOPIC MODEL

Efficient Uncertainty Estimation with Gaussian Process for Reliable Dialog Response Retrieval

Join the IEEE Signal Processing Society