Skip to main content

Intelligent systems' Holistic Evolving Analysis of Real-life Universal speaker characteristics

Final Report Summary - IHEARU (Intelligent systems' Holistic Evolving Analysis of Real-life Universal speaker characteristics)

A major aim of iHEARu was to provide knowledge and technology enabling a holistic understanding of all the paralinguistic facets of human speech in tomorrow's information, communication, and entertainment systems. Towards this aim, research focused on multi-task learning, active learning, semi-supervised learning, unsupervised learning, cooperative learning, and the construction of evolving and self-learning systems. Research on multi-task learning concentrated on novel deep learning techniques for the joint prediction of multiple speaker states and traits. Other research used attributes generated by well-resourced learning systems (e.g. emotion recognition systems) as features for related learning tasks for which labelled data is scarce (e.g. detecting deceptive speech).
Towards the goal of technology for everyday use, significant gains were made using signal processing techniques for acoustic feature generation and enhancement. These techniques successfully increased system robustness to unwanted variability and mismatch between the training and recognition (i.e. actual usage) conditions/domains. Novel techniques developed include unsupervised signal noise reduction that is easily reconfigurable in changing acoustic conditions, end-to-end learning to directly predict the desired recognition target from the raw signal, data augmentation and enrichment to better handle scarcity of labelled high-quality training material, deep neural networks for automatic feature learning, and feature enhancement by means of transfer learning and cross-domain feature mapping. Robust acoustic features with good generalisability were also developed.
Meeting project goals, several publicly available toolkits for feature generation were developed: openXBOW, the first open-source toolkit for the generation of crossmodal bag-of-words representations; DeepSpectrum, a Python toolkit that passes spectrograms through a pre-trained image convolutional neural network, and auDeep, a Python toolkit for deep unsupervised learning based on a recurrent sequence to sequence autoencoder. CAS2T, a user-specified self-optimising paralinguistic recognition system for the organization of web-based multimedia sources, was used to investigate clustering of metadata, acoustic clustering, and voice activity/content detection.
Going beyond pure simulation experiments, iHEARu pioneered self-learning methods on large-scale data in computational paralinguistics. Large, realistic, richly annotated, and transcribed data sets, including iHEARu-EAT, L2, EmotAsS, EmoFilm, and DEMoS were built using large-scale speech and meta-data mining of public sources such as social media, together with crowdsourcing for labelling and quality control, and shared semi-automatic annotation. To exploit information from large-scale unlabelled datasets, a variety of active learning and confidence estimation approaches that automatically identify salient speech samples for human annotation were developed. Further, semi-supervised learning and dynamic active learning were combined in cooperative learning, a technique that significantly reduces the need for labour-intensive human annotation.
Key to the universal analysis developed in iHEARu is the constant use of humans in the loop to help monitor the success of our automatic labelling methods and enable novel human perception studies that provide insights into useful acoustic features and human reasoning in challenging listening conditions. In this regard, we developed the web-based multiplayer game iHEARu-PLAY for crowdsourced database collection and labelling, and VoiLA, a novel free web-based speech classification tool.