Building Keyword Search System from End-to-End ASR Systems

Ruizhe Huang (Johns Hopkins University); Matthew S Wiesner (Johns Hopkins University); Paola Garcia (Johns Hopkins University); Daniel Povey (Johns Hopkins University); Jan Trmal (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

09 Jun 2023

Keyword search (KWS) systems are commonly built on top of existing automatic speech recognition (ASR) systems. However, end-to-end (E2E) ASR models are not naturally equipped with word-level timing information or confidence. Existing methods for re-purposing E2E ASR systems for KWS are largely heuristic or model-specific. In this paper, we describe a general KWS pipeline, applicable to any ASR model that generates N -best lists. We extract timing information using either external word-aligners, or time-preserving weighted finite-state transducer-based decoders. We show that our light-weight, ASR-agnostic approach for confidence estimation based on N-best lists outperforms other commonly used heuristics, such as using the decoder’s softmax probability, and even a more complicated dedicated confidence estimation model (CEM). Finally, we compare our performance to hybrid ASR models, extensively evaluating the impact of word-level timing, confidence, and recall on KWS performance. Our KWS pipeline is available online, suitable for evaluating the aforementioned ASR components as a downstream task.

Tags:

Audio for multimedia and audio processing systems