Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Pingchuan Ma (Meta); Alexandros Haliassos (Imperial College London); Adriana Fernandez-Lopez (Meta); Honglie Chen (Meta); Stavros Petridis (Imperial College London); Maja Pantic (Facebook / Imperial College London )
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Audio-visual speech recognition has received a lot of attention due
to its robustness against acoustic noise. Recently, the performance of
automatic, visual, and audio-visual speech recognition (ASR, VSR,
and AV-ASR, respectively) has been substantially improved, mainly
due to the use of larger models and training sets. However, accurate
labelling of datasets is time-consuming and expensive. Hence, in this
work, we investigate the use of automatically-generated transcriptions
of unlabelled datasets to increase the training set size. For this
purpose, we use publicly-available pre-trained ASR models to automatically
transcribe unlabelled datasets such as AVSpeech and Vox-
Celeb2. Then, we train ASR, VSR and AV-ASR models on the augmented
training set, which consists of the LRS2 and LRS3 datasets
as well as the additional automatically-transcribed data. We demonstrate
that increasing the size of the training set, a recent trend in the
literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on
AV-ASR on LRS2 and LRS3. In particular, it achieves a WER
of 0.9% on LRS3, a relative improvement of 30% over the current
state-of-the–art approach, and outperforms methods that have
been trained on non-publicly available datasets with 26 times more
training data.