Hang-Yip Liu, Steffen Schulze, et al.
Proceedings of SPIE - The International Society for Optical Engineering
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
Hang-Yip Liu, Steffen Schulze, et al.
Proceedings of SPIE - The International Society for Optical Engineering
Imran Nasim, Michael E. Henderson
Mathematics
I.K. Pour, D.J. Krajnovich, et al.
SPIE Optical Materials for High Average Power Lasers 1992
Kenneth L. Clarkson, K. Georg Hampel, et al.
VTC Spring 2007