Skip to main content

TalkingHeads: Audiovisual Speech Recognition in-the-wild

Periodic Reporting for period 1 - TalkingHeads (TalkingHeads: Audiovisual Speech Recognition in-the-wild)

Reporting period: 2016-06-01 to 2018-05-31

Audio-visual (AV) Automatic Speech Recognition (ASR) in unconstrained (in-the-wild) videos collected from real-world multimedia databases (outdoor conversation/interviews, TV shows with multiple speakers) using novel deep learning methodologies and architectures.


There are numerous applications associated to visual speech recognition, ranging from medical application (Aphonia, Dysphonia, hearing aids, a.o.) to AV-ASR for noisy environments, and from audiovisual biometrics to surveillance.


1. To improve the state-of-the-art in AV ASR by using novel deep learning methods.
2. To introduce new applications related to AV ASR.
3. To transfer knowledge between Speech Recognition and Computer Vision.


1. Visual speech recognition is now a mature technology. It is capable of increasing the accuracy of audio-only speech recognition in both clean and noisy environments, and even when the speaker' mouth area in not captured in high resolution.
2. Deep learning is currently the dominant machine learning paradigm for addressing tasks related to AV ASR, and novel, large and in-the-wild databases are the key towards deploying deep learning methods.
4. We believe that the output of this TalkingHeads constitutes a significant contribution to the development of AV ASR applications.

-Visual Speech Recognition

We developed an end-to-end deep learning architecture for word-level visual speech recognition and we attained 17.0% word error rate on BBC-TV [3].

We further modified it and we attained 11.9% word error rate. We also demonstrated promising results on low-shot learning, i.e. on words with few examples for training [4].

We worked on visual key-word spotting. Our system obtained very promising KWS results for keywords unseen during training [7].

-Audiovisual Speech Recognition

We collaborated with Imperial College London in developing (a) the first truly end-to-end system audiovisual word recognition [5], a hybrid CTC/attention architecture for audio-visual ASR [9].

We performed an exhaustive experimentation on the challenging ""Lipreading in the wild"" database [8].

-Audio-only Speech and Speaker Recognition

We participated with the I4U consortium in the prestigious NIST Speaker Recognition Evaluation (SRE) of 2016 [4] [5].

We contributed in the development of a generic method called meta-embeddings which improves i-vector/PLDA model by 20% [6].

We helped towards improving a text-dependent method for text-dependent and text-prompted speaker recognition. The method goes beyond state-of-the-art in the challenging RSR-2015 part III benchmark [10].


We organised a one-day workshop at BMVC-2017 that took place in Imperial College London (visit:


[1] KA Lee, H Sun, S Aleksandr, W Guangsen, T Stafylakis, G Tzimiropoulos, et al. “The I4U submission to the 2016 NIST speaker recognition evaluation”, NIST SRE 2016 Workshop, 2016.

[2] KA Lee, V Hautamäki, T Kinnunen, A Larcher, C Zhang, A Nautsch, T Stafylakis, G Tzimiropoulos, et al. “The I4U mega fusion and collaboration for NIST speaker recognition evaluation 2016”, ISCA Interspeech 2017.

[3] T Stafylakis and G Tzimiropoulos, “Combining Residual Networks with LSTMs for Lipreading”, ISCA Interspeech 2017.

[4] T Stafylakis and G Tzimiropoulos, “Deep word embeddings for visual speech recognition”, IEEE ICASSP 2018.

[5] S Petridis, T Stafylakis, P Ma, F Cai, G Tzimiropoulos, M Pantic, “End-to-end Audiovisual Speech Recognition”, IEEE ICASSP 2018.

[6] N Brummer, A Silnova, L Burget, T Stafylakis, “Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model”, ISCA Odyssey 2018.

[7] T Stafylakis and G Tzimiropoulos, “Zero-shot keyword spotting for visual speech recognition in-the-wild”, ECCV 2018 (accepted).

[8] T Stafylakis, MH Khan and G Tzimiropoulos, “Pushing the boundaries of audiovisual word recognition using Spatiotemporal Residual Networks and LSTMs”, Computer Vision and Image Understanding, Elsevier (Journal Publication, current status: “minor revisions”).

[9] S Petridis, T Stafylakis, Pingchuan Ma, G Tzimiropoulos, M Pantic, “Audio-visual speech recognition with a hybrid CTC/attention architecture”, IEEE Workshop on Spoken Language Technology (SLT), 2018 (current status: “under review”).

[10] N Maghsoodi, H Sameti, H Zeinali, T Stafylakis, “Speaker Recognition with Random Digit Strings Using Uncertainty Normalized HMM-based i-vectors”, IEEE/ACM Transactions on Audio, Speech and Language Processing (Journal Publication, current status: “under review”).

(1) We were the first to proposed the use of Residual Networks for lipreading, showing exceptional results on the challenging ""Lipreading in the wild"" database. Apart from yielding the best results published so far on LRW (and by a large margin), our approach has been used for several other applications, such as keyword spotting in silent movies, speech separation and enhancement, and visual-only Large Vocabulary Continuous Speech Recognition, by top-level Universities and companies (including VGG group at University of Oxford, Google DeepMind and i-bug group at Imperial College London).

(2) Our AV works are amongst the first to clearly demonstrate the gains attained by including visual information in ASR. While most works manage to attain notable gains only when noise is added to the audio component, we demonstrated significant gains even without such additive noise.

(3) We proposed the first visual-only Keyword Spotting (KWS) for Query-by-Text architecture. We are the first (a) to deploy a Grapheme-to-Phoneme model for representing text queries, and (b) to show how localization of queries can be performed even without having transcripts that are aligned with the video during training. Finally, our experiments on the most challenging in-the-wild database show drastic improvements over the current state-of-the-art.

(4) Our works on audio-only speech and speaker recognition improved over the current state-of-the-art in text-independent, text-dependent and text-prompted speaker recognition.


(1) Audiovisual ASR applications on mobiles: Voice assistant systems in mobile phones can include lip reading in order to improve their accuracy, especially in noisy environments, or in case of silent detection.

(2) Audiovisual ASR applications on multimedia: Lip reading can be used in order to perform audiovisual ASR, speech enhancement and speaker separation.

(3) Medical applications: Lip reading can be used for patience with disorder of the vocal cords (Dysphonia), or Aphonia, i.e. the complete inability to produce voice, as a result of thyroid cancer, laryngectomy and tracheostomy.

(4) Biometric applications: Lip reading and audiovisual ASR can be used for improving biometric authentication and make it less vulnerable to spoofing attacks compared to audio-only.

(5) Surveillance and forensic applications: Lip reading can be used by law enforcement agencies or by courts in order to examine visual or audiovisual recordings."
TalkingHeads: Audiovisual Speech Recognition in-the-wild