End-to-end multi-modal speech recognition with air and bone conducted speech

Junqi Chen, Mou Wang, Xiao-Lei Zhang, Zhiyong Huang, Susanto Rahardja

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:06:34

08 May 2022

Improving the performance of automatic speech recognition (ASR) in adverse acoustic environments is a long-term tough task. Although many robust ASR systems based on conventional microphones have been developed, their performance with air-conducted (AC) speech is still far from satisfactory in low signal-to-noise-ratio (SNR) environments. Bone-conducted (BC) speech is relatively insensitive to ambient noise, and has a potential of promoting the ASR performance at such low SNR environments as an auxiliary source. In this paper, we propose a conformer-based multi-modal speech recognition system. It uses a conformer encoder and a transformer-based truncated decoder to extract the semantic information from AC and BC channels respectively. The semantic information of the two channels are re-weighted and integrated by a novel multi-modal transducer. Experimental results show the effectiveness of the proposed method. For example, given a 0 dB SNR environment, it yields a character error rate of over $59.0%$ lower than a noise-robust baseline conducted on AC channel only, and over $12.7%$ lower than a multi-modal baseline that takes the concatenated features of AC and BC speech as the input.

Tags:

multi-modal transducer

robust speech recognition

bone conduction

End-to-end multi-modal speech recognition with air and bone conducted speech

Junqi Chen, Mou Wang, Xiao-Lei Zhang, Zhiyong Huang, Susanto Rahardja

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ATTENTION-BASED FUSION FOR BONE-CONDUCTED AND AIR-CONDUCTED SPEECH ENHANCEMENT IN THE COMPLEX DOMAIN

WAV2VEC-SWITCH: CONTRASTIVE LEARNING FROM ORIGINAL-NOISY SPEECH PAIRS FOR ROBUST SPEECH RECOGNITION

Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition

Join the IEEE Signal Processing Society