Skip to main content

PRACTICE OF THE CONFORMER ENHANCED AUDIO-VISUAL HUBERT ON MANDARIN AND ENGLISH

Xiaoming Ren (OPPO); Chao Li (OPPO); Shenjian Wang (OPPO); Li Biao (oppo)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
09 Jun 2023

Considering the bimodal nature of human speech perception, lips and teeth movement has a pivotal role in automatic speech recognition. Benefited from the correlated and noise-invariant visual information, audio-visual recognition systems enhance robustness in multiple scenarios. In previous work, audiovisual HuBERT appears to be the finest practice incorporating modality knowledge. This paper set out to provide a mixed methodology, which named conformer enhanced AV-HuBERT, boosting AV-HuBERT system’s performance a step further. Comparing with baseline AV-HuBERT, our method in one-phase evaluation of clean and noisy conditions receives 7% and 16% WER reduction relatively on English AVSR benchmark dataset LRS3. Furthermore we establish a novel 1000h Mandarin AVSR dataset CSTS. On top of baseline AV-HuBERT, we exceed WeNet ASR system 14% and 18% relatively on MISP and CMLR by pre-training with this dataset. The conformer enhanced AV-HuBERT we proposed brings 7% on MISP and 6% CER reduction on CMLR, comparing with baseline AV-HuBERT system.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00