Final Report Summary - IHEARU (Intelligent systems' Holistic Evolving Analysis of Real-life Universal speaker characteristics)
Towards the goal of technology for everyday use, significant gains were made using signal processing techniques for acoustic feature generation and enhancement. These techniques successfully increased system robustness to unwanted variability and mismatch between the training and recognition (i.e. actual usage) conditions/domains. Novel techniques developed include unsupervised signal noise reduction that is easily reconfigurable in changing acoustic conditions, end-to-end learning to directly predict the desired recognition target from the raw signal, data augmentation and enrichment to better handle scarcity of labelled high-quality training material, deep neural networks for automatic feature learning, and feature enhancement by means of transfer learning and cross-domain feature mapping. Robust acoustic features with good generalisability were also developed.
Meeting project goals, several publicly available toolkits for feature generation were developed: openXBOW, the first open-source toolkit for the generation of crossmodal bag-of-words representations; DeepSpectrum, a Python toolkit that passes spectrograms through a pre-trained image convolutional neural network, and auDeep, a Python toolkit for deep unsupervised learning based on a recurrent sequence to sequence autoencoder. CAS2T, a user-specified self-optimising paralinguistic recognition system for the organization of web-based multimedia sources, was used to investigate clustering of metadata, acoustic clustering, and voice activity/content detection.
Going beyond pure simulation experiments, iHEARu pioneered self-learning methods on large-scale data in computational paralinguistics. Large, realistic, richly annotated, and transcribed data sets, including iHEARu-EAT, L2, EmotAsS, EmoFilm, and DEMoS were built using large-scale speech and meta-data mining of public sources such as social media, together with crowdsourcing for labelling and quality control, and shared semi-automatic annotation. To exploit information from large-scale unlabelled datasets, a variety of active learning and confidence estimation approaches that automatically identify salient speech samples for human annotation were developed. Further, semi-supervised learning and dynamic active learning were combined in cooperative learning, a technique that significantly reduces the need for labour-intensive human annotation.
Key to the universal analysis developed in iHEARu is the constant use of humans in the loop to help monitor the success of our automatic labelling methods and enable novel human perception studies that provide insights into useful acoustic features and human reasoning in challenging listening conditions. In this regard, we developed the web-based multiplayer game iHEARu-PLAY for crowdsourced database collection and labelling, and VoiLA, a novel free web-based speech classification tool.