Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Models

Ramon R Sanabria (The University Of Edinburgh); Hao Tang (The University of Edinburgh); Sharon Goldwater (University of Edinburgh)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing variable-length spoken word segments. In this work, we study several pre-trained models and pooling methods for constructing AWEs with self-supervised representations. Owing to the contextualized nature of self-supervised representations, we hypothesize that simple pooling methods, such as averaging, might already be useful for constructing AWEs. When evaluating on a standard word discrimination task, we find that HuBERT representations with mean-pooling rival the state of the art on English AWEs. More surprisingly, despite being trained only on English, HuBERT representations evaluated on Xitsonga, Mandarin, and French consistently outperform the multilingual model XLSR-53 (as well as Wav2Vec 2.0 trained on English).

Tags:

Word spotting, VAD, and other topics in speech recognition

Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Models

Ramon R Sanabria (The University Of Edinburgh); Hao Tang (The University of Edinburgh); Sharon Goldwater (University of Edinburgh)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

FEDERATED LEARNING FOR ASR BASED ON WAV2VEC 2.0

Neural Diarization with Non-autoregressive Intermediate Attractors

Join the IEEE Signal Processing Society