Community Research and Development Information Service - CORDIS

ERC

IHEARU Report Summary

Project ID: 338164
Funded under: FP7-IDEAS-ERC
Country: Germany

Mid-Term Report Summary - IHEARU (Intelligent systems' Holistic Evolving Analysis of Real-life Universal speaker characteristics)

A major aim of the iHEARu project is to provide the knowledge and technology required for a holistic understanding of all the paralinguistic facets of human speech in tomorrow's real-life information, communication and entertainment systems. In this regard, research in this period has focused on: multi-task learning, active learning, semi-supervised learning, cooperative learning as well as the construction of evolving and self-learning systems. Research on multi-task learning has concentrated on novel deep learning techniques for the joint prediction of multiple paralinguistic attributes. Research has also been undertaken in using attributes generated by well-resourced learning systems (e.g., for emotion recognition) as features for related learning tasks for which labelled data is scarce (e.g., detecting deceptive speech).

In meeting the goals of the project, large-scale speech and metadata mining from public sources (e.g. social media), combined with semi-automatic annotation methods (e.g., active learning) is essential for building large, realistic, richly annotated and transcribed data sets. To exploit the information from a large-scale unlabelled dataset, a variety of active learning approaches, which automatically identify salient speech samples for human annotation, have been proposed. Other techniques to leverage the value from unlabelled data, such as semi-supervised learning; which requires almost none of human input in identifying salient speech samples, have also been explored. Further, we investigated cooperative learning approaches, where the salient samples identified with high confidence values are automatically labelled and ones with lower confidence values are labelled by humans.

A key proposal by the iHEARu project was to move beyond pure simulation experiments and pioneer self-learning methods on large scale data within computational paralinguistics. The development of a self-learning system has begun with the aim identifying salient audio and video clips in media archives and video-sharing websites, such as YouTube. The system will make use of: user-specified tags and comments, clip associations and recommendations within a particular website and the results of unsupervised acoustic or visual content analysis, clustering, and activity detection to automatically build large scale and richly annotated datasets.

In terms of the real-life aspect of the project in particular, significant gains have been made using signal processing techniques for acoustic feature generation and enhancement. These techniques have been used to successfully increase system robustness to unwanted variability and mismatch between the training and recognition (i.e., actual usage) conditions/domains. Novel techniques proposed include: unsupervised signal noise reduction that can be easily reconfigured for changing acoustic conditions; end-to-end learning in order to directly predict the desired recognition target from the raw signal; robust acoustic features with good generalisation abilities; data augmentation and enrichment to better handle scarcity of labelled high-quality training material; deep neural networks for automatic feature learning; and feature enhancement by means of transfer learning and cross-domain feature mapping.

A key aspect of universal analysis proposed by iHEARu is the constant use of humans in the loop to both help monitor the success of our automatic labelling methods and for novel listening experiments to understand insights into useful acoustic features and human reasoning in challenging listening conditions. In this regard, we have developed the web-based multiplayer game iHEARu-PLAY for crowdsourced database collection and labelling. Further, to estimate the rater-dependent consistency over crowdsourced-based ratings, a novel evaluator, termed Weighted Trustability Evaluator (WTE), has been proposed for crowdsourced-based ratings.

Contact

Sabine Wiendl, (Abteilungsleiterin, LEAR)
Tel.: +498515091110
Fax: +498515091002
E-mail
Record Number: 189655 / Last updated on: 2016-10-13