Context-aware Fine-tuning of Self-supervised speech models

Suwon Shon (ASAPP); Felix Wu (ASAPP); Kwangyoun Kim (ASAPP); Prashant Sridhar (ASAPP); Karen Livescu (TTI-Chicago); Shinji Watanabe (Carnegie Mellon University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e. surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. We attach a context module on top of the last layer of a pre-trained model to encode the whole segment into a context embedding vector which is then used as an additional feature for the final prediction. During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments. This allows the model to make predictions without access to these surrounding segments at inference time and requires only a tiny overhead compared to standard fine-tuned models. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: Automatic Speech Recognition (ASR), Named Entity Recognition (NER), and Sentiment Analysis (SA). The results show that context-aware fine-tuning not only outperforms a standard fine-tuning baseline but also rivals a strong context injection baseline that uses neighboring speech segments during inference.

Tags:

Acoustic modeling for automatic speech recognition

Context-aware Fine-tuning of Self-supervised speech models

Suwon Shon (ASAPP); Felix Wu (ASAPP); Kwangyoun Kim (ASAPP); Prashant Sridhar (ASAPP); Karen Livescu (TTI-Chicago); Shinji Watanabe (Carnegie Mellon University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Lattice-free Sequence Discriminative Training for Phoneme-based Neural Transducers

DELAY-PENALIZED TRANSDUCER FOR LOW-LATENCY STREAMING ASR

PREDICTING MULTI-CODEBOOK VECTOR QUANTIZATION INDEXES FOR KNOWLEDGE DISTILLATION

Join the IEEE Signal Processing Society