PRACTICE OF THE CONFORMER ENHANCED AUDIO-VISUAL HUBERT ON MANDARIN AND ENGLISH
Xiaoming Ren (OPPO); Chao Li (OPPO); Shenjian Wang (OPPO); Li Biao (oppo)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Considering the bimodal nature of human speech perception,
lips and teeth movement has a pivotal role in automatic speech
recognition. Benefited from the correlated and noise-invariant
visual information, audio-visual recognition systems enhance
robustness in multiple scenarios. In previous work, audiovisual
HuBERT appears to be the finest practice incorporating
modality knowledge. This paper set out to provide
a mixed methodology, which named conformer enhanced
AV-HuBERT, boosting AV-HuBERT system’s performance
a step further. Comparing with baseline AV-HuBERT, our
method in one-phase evaluation of clean and noisy conditions
receives 7% and 16% WER reduction relatively on English
AVSR benchmark dataset LRS3. Furthermore we establish
a novel 1000h Mandarin AVSR dataset CSTS. On top of
baseline AV-HuBERT, we exceed WeNet ASR system 14%
and 18% relatively on MISP and CMLR by pre-training with
this dataset. The conformer enhanced AV-HuBERT we proposed
brings 7% on MISP and 6% CER reduction on CMLR,
comparing with baseline AV-HuBERT system.