ON WORD ERROR RATE DEFINITIONS AND THEIR EFFICIENT COMPUTATION FOR MULTI-SPEAKER SPEECH RECOGNITION SYSTEMS

Thilo von Neumann (Paderborn University); Christoph B Boeddeker (Paderborn University); Keisuke Kinoshita (Google); Marc Delcroix (NTT); Reinhold Haeb-Umbach (University of Paderborn)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distance tensor under the constraint that a reference utterance must be matched consistently with one hypothesis output. This also results in an efficient implementation of the ORC WER which previously suffered from exponential complexity. We give an overview of commonly used WER definitions for multi-speaker scenarios and show that they are specializations of the above MIMO WER tuned to particular application scenarios. We conclude with a discussion of the pros and cons of the various WER definitions and a recommendation when to use which.

Tags:

Word spotting, VAD, and other topics in speech recognition

ON WORD ERROR RATE DEFINITIONS AND THEIR EFFICIENT COMPUTATION FOR MULTI-SPEAKER SPEECH RECOGNITION SYSTEMS

Thilo von Neumann (Paderborn University); Christoph B Boeddeker (Paderborn University); Keisuke Kinoshita (Google); Marc Delcroix (NTT); Reinhold Haeb-Umbach (University of Paderborn)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

FEDERATED LEARNING FOR ASR BASED ON WAV2VEC 2.0

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Joint unsupervised and supervised learning for context-aware language identification

Join the IEEE Signal Processing Society