Community Research and Development Information Service - CORDIS


SUMMA Report Summary

Project ID: 688139
Funded under: H2020-EU.2.1.1.

Periodic Reporting for period 1 - SUMMA (Scalable Understanding of Multilingual Media)

Reporting period: 2016-02-01 to 2017-07-31

Summary of the context and overall objectives of the project

Media monitoring enables the global news media to be viewed in terms of emerging trends, people in the news, and the evolution of story-lines. The massive growth in the number of broadcast and Internet media channels means that current approaches can no longer cope with the scale of the problem.

The aim of SUMMA is to significantly improve media monitoring by creating a platform to automate the analysis of media streams across many languages, to aggregate and distil the content, to automatically create rich knowledge bases, and to provide visualisations to cope with this deluge of data.

SUMMA has six objectives: (1) Development of a scalable and extensible media monitoring platform; (2) Development of high-quality and richer tools for analysts and journalists; (3) Extensible natural language processing analysis and automated knowledge base construction; (4) Multilingual and cross-lingual capabilities for streaming audio, video, and text; (5) Sustainable, maintainable platform and services; (6) Dissemination and communication of project results to stakeholders and user group.

Achieving these aims will require advancing the state of the art in a number of technologies: multilingual stream processing including speech recognition, machine translation, and story identification; entity and relation extraction; natural language understanding including deep semantic parsing, summarisation, and sentiment detection; and rich visualisations based on multiple views and dealing with many data streams.

The project will focus on three use cases: (1) External media monitoring - intelligent tools to address the dramatically increased scale of the global news monitoring problem; (2) Internal media monitoring - managing content creation in several languages efficiently by ensuring content created in one language is reusable by all other languages; (3) Data journalism.

The outputs of the project will be field-tested at partners BBC and DW, and the platform will be further validated through innovation intensives such as the BBC NewsHack.

Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far

"The principal achievements of the project have been the design, implementation, and evaluation of the SUMMA Platform, a scalable multilingual media monitoring platform that combines ""shallow"" media stream processing with ""deep"" natural language processing (NLP). The main components of the SUMMA platform are: (1) multilingual stream processing, including speech recognition, machine translation, and story identification; (2) knowledge base construction, based on entity and relation extraction; (3) natural language understanding, including deep semantic parsing, summarisation, and sentiment detection; (4) visualisations, based on multiple views (e.g., topic, person, or timeline); and (5) a backend database to aggregate and process the monitored content.

The platform is currently focused on two use cases: (1) External media monitoring, exemplified by the operations of BBC Monitoring; and (2) Internal media monitoring – the efficient management of multilingual content creation within an organisation, exemplified by the operations of Deutsche Welle.

The work performed during the first period of the project, and the main achieved results are summarised by the following completed milestones: Data dump architecture operational (MS1); Requirements analysis for internal and external media monitoring use cases (MS2); Live streams operational (MS3); Architectures and APIs of SUMMA platform agreed (MS4); Demonstration and initial evaluation of stream processing technologies (MS5); Demonstration and initial evaluation of multilingual entity recognition, linking, coreference (MS6); Demonstration and initial evaluation of semantic role labelling (MS7); Initial prototypes for internal and external media monitoring use cases (MS8); Internal release of SUMMA platform (MS9); and the establishement of a project user group and the first user group event.";

Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)

SUMMA will progress the state of the art in several dimensions:

1. Data Collection and Management.
SUMMA will provide both a mechanism to supply training data for broadcast media in much greater volumes than previous efforts, and will provide test data at scale (hundreds of live streams). The scale of this data will provide a basis for the progress that we plan to make in the technology areas described below.

2. Large-Scale Machine Learning and Prediction.
SUMMA will develop: (1) new combinatorial formulations of structured prediction problems that are amenable to fast approximate decoding algorithms, such as message-passing and dual decomposition methods; (2) statistical learning approaches that lead to sparse, compact models, hence reducing the memory footprint and the number of features; (3) fast and scalable online clustering algorithms for detecting and tracking story lines that cluster together related news articles; (4) approaches using a large pool of unlabeled data for which some sort of weak supervision is possible; (5) lightly supervised approaches to develop speech recognition models for new languages and dialects; (6) adaptation algorithms for petascale language models.

3. Speech Recognition.
SUMMA will improve core acoustic and language modelling (and hence improve the accuracy of the deployed systems), to develop effective adaptive approaches, and to port systems to new languages, in the context of different levels of training data resource,

4. Machine Translation.
SUMMA will deliver high-quality, scalable, adaptable machine translation systems over several language pairs, developing and evaluating novel neural machine translation approaches.

5. Segmentation, Clustering and Topic Detection.
SUMMA will develop a unified multilingual framework to group incoming news articles into tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve.

6. Natural Language Understanding.
SUMMA will: (1) develop statistical semantic parsers that go beyond sentences to operate at storyline level; (2) generate story highlights by taking the output of the semantic parser, and synthesising a coherent summary of all events that occur in the story according to the semantic parser; (3) perform sentiment analysis for a storyline.

7. Automated Knowledge Base Construction.
SUMMA will: (1) develop multilingual entity recognition, linking, and coreference resolution; (2) carry out relation extraction across multiple languages; (3) extend knowledge bases to new relations; (4) develop a new activity on automated fact checking.

8. Scalable Platform.
SUMMA will develop a flexible, modular platform that integrates the component technologies with a focus on the project use cases.

The impact of the project arises from the development of the efficient, scalable, multilingual platform, that has a flexible, modular construction, and includes components with state-of-the-art accuracy. These four dimensions open the door for the use of deeper linguistic processing coupled with media stream processing in the above areas, which up to now rely, to a large extent, on manual labour or in extremely shallow automatic techniques. To assess and validate progress in these four directions, rigorous experiments will be carried out in the context of the use cases.

Related information

Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top