Convolution-Based Attention Model With Positional Encoding For Streaming Speech Recognition On Embedded Devices

Jinhwan Park, Chanwoo Kim, Wonyong Sung

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 0:11:09

19 Jan 2021

On-device automatic speech recognition (ASR) is very preferred over server-based implementations owing to its low latency and privacy protection. Many server-based ASRs employ recurrent neural networks (RNNs) to exploit their ability to recognize long sequences with an extremely small number of states; however, they are inefficient for single-stream implementations in embedded devices. In this study, a highly efficient convolutional model-based ASR with monotonic chunkwise attention is developed. Although temporal convolution-based models allow more efficient implementations, they demand a long filter-length to avoid looping or skipping problems. To remedy this problem, we added positional encoding, while shortening the filter length, to a convolution-based ASR encoder. It is demonstrated that the accuracy of the short filter-length convolutional model is significantly improved. In addition, the effect of positional encoding is analyzed by visualizing the attention energy and encoder outputs. The proposed model achieves the word error rate of 11.20% on TED-LIUMv2 for an end-to-end speech recognition task.

Tags:

sps conference

slt 2021

Convolution-Based Attention Model With Positional Encoding For Streaming Speech Recognition On Embedded Devices

Jinhwan Park, Chanwoo Kim, Wonyong Sung

Value-Added Bundle(s) Including this Product

SLT 2021 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society