FULLY UNSUPERVISED TOPIC CLUSTERING OF UNLABELLED SPOKEN AUDIO USING SELF-SUPERVISED REPRESENTATION LEARNING AND TOPIC MODEL

Takashi Maekaku (Yahoo Japan Corporation); Yuya Fujita (Yahoo Japan Corporation); Xuankai Chang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to classify topics. The spoken term discovery is the most important and difficult part. In this paper, we propose to combine self-supervised representation learning (SSRL) methods as a component of spoken term discovery and probabilistic topic models. Most SSRL methods pre-train a model which predicts high-quality pseudo labels generated from an audio-only corpus. These pseudo labels can be used to produce a sequence of pseudo subwords by applying deduplication and a subword model. Then, we apply a topic model based on latent Dirichlet allocation for these pseudo-subword sequences in an unsupervised manner. The clustering performance is evaluated on the Fischer corpus using normalized mutual information. We confirm the improvement of the proposed method and its effectiveness compared to an existing approach using dynamic time warping and topic models although the experimental setups are not directly comparable.

Tags:

Spoken document retrieval and written text mining

FULLY UNSUPERVISED TOPIC CLUSTERING OF UNLABELLED SPOKEN AUDIO USING SELF-SUPERVISED REPRESENTATION LEARNING AND TOPIC MODEL

Takashi Maekaku (Yahoo Japan Corporation); Yuya Fujita (Yahoo Japan Corporation); Xuankai Chang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

M3ST: MIX AT THREE LEVELS FOR SPEECH TRANSLATION

Efficient Uncertainty Estimation with Gaussian Process for Reliable Dialog Response Retrieval

MHLAT: Multi-hop Label-wise Attention Model for Automatic ICD Coding

Join the IEEE Signal Processing Society