Skip to main content

FULLY UNSUPERVISED TOPIC CLUSTERING OF UNLABELLED SPOKEN AUDIO USING SELF-SUPERVISED REPRESENTATION LEARNING AND TOPIC MODEL

Takashi Maekaku (Yahoo Japan Corporation); Yuya Fujita (Yahoo Japan Corporation); Xuankai Chang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
08 Jun 2023

Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to classify topics. The spoken term discovery is the most important and difficult part. In this paper, we propose to combine self-supervised representation learning (SSRL) methods as a component of spoken term discovery and probabilistic topic models. Most SSRL methods pre-train a model which predicts high-quality pseudo labels generated from an audio-only corpus. These pseudo labels can be used to produce a sequence of pseudo subwords by applying deduplication and a subword model. Then, we apply a topic model based on latent Dirichlet allocation for these pseudo-subword sequences in an unsupervised manner. The clustering performance is evaluated on the Fischer corpus using normalized mutual information. We confirm the improvement of the proposed method and its effectiveness compared to an existing approach using dynamic time warping and topic models although the experimental setups are not directly comparable.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00