NON-AUTOREGRESSIVE ASR WITH SELF-CONDITIONED FOLDED ENCODERS

Tatsuya Komatsu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:08:05

11 May 2022

This paper proposes CTC-based non-autoregressive ASR with self-conditioned folded encoders. The proposed method realizes non-autoregressive ASR with fewer parameters by folding the conventional stack of encoders into only two blocks; base encoders and folded encoders. The base encoders convert the input audio features into a neural representation suitable for recognition. This is followed by the folded encoders applied repeatedly for further refinement. Applying the CTC loss to the outputs of all encoders enforces the consistency of the input-output relationship. Thus, folded encoders learn to perform the same operations as an encoder with deeper distinct layers. In experiments, we investigate how to set the number of layers and the number of iterations for the base and folded encoders. The results show that the proposed method achieves a performance comparable to that of the conventional method using only 38% as many parameters. Furthermore, it outperforms the conventional method when increasing the number of iterations.

Tags:

conformer

self-conditioned ctc

intermediate ctc

non-autoregressive asr

ctc

NON-AUTOREGRESSIVE ASR WITH SELF-CONDITIONED FOLDED ENCODERS

Tatsuya Komatsu

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

SPATIO-TEMPORAL GRAPH CONVOLUTIONAL NETWORKS FOR CONTINUOUS SIGN LANGUAGE RECOGNITION

RUN-AND-BACK STITCH SEARCH: NOVEL BLOCK SYNCHRONOUS DECODING FOR STREAMING ENCODER-DECODER ASR

CONFORMER-BASED SPEECH RECOGNITION WITH LINEAR NYSTR™M ATTENTION AND ROTARY POSITION EMBEDDING

Join the IEEE Signal Processing Society