Skip to main content
European Commission logo print header

Scalable Understanding of Multilingual Media

Periodic Reporting for period 2 - SUMMA (Scalable Understanding of Multilingual Media)

Reporting period: 2017-08-01 to 2019-01-31

Media monitoring enables the global news media to be viewed in terms of emerging trends, people in the news, and the evolution of story-lines. The massive growth in the number of broadcast and Internet media channels means that current approaches can no longer cope with the scale of the problem.

The aim of SUMMA is to significantly improve media monitoring by creating a platform to automate the analysis of media streams across many languages, to aggregate and distil the content, to automatically create rich knowledge bases, and to provide visualisations to cope with this deluge of data.

SUMMA has six objectives: (1) Development of a scalable and extensible media monitoring platform; (2) Development of high-quality and richer tools for analysts and journalists; (3) Extensible natural language processing analysis and automated knowledge base construction; (4) Multilingual and cross-lingual capabilities for streaming audio, video, and text; (5) Sustainable, maintainable platform and services; (6) Dissemination and communication of project results to stakeholders and user group.

Achieving these aims requires advancing the state of the art in a number of technologies: multilingual stream processing including speech recognition, machine translation, and story identification; entity and relation extraction; natural language understanding including deep semantic parsing, summarisation, and sentiment detection; and rich visualisations based on multiple views and dealing with many data streams.

The project focuses on three use cases: (1) External media monitoring - intelligent tools to address the dramatically increased scale of the global news monitoring problem; (2) Internal media monitoring - managing content creation in several languages efficiently by ensuring content created in one language is reusable by all other languages; (3) Data-driven media analysis.

The outputs of the project were field-tested at partners BBC and DW, and the platform was further validated through innovation intensives such as the BBC NewsHack.
"The principal achievements of the project have been the design, implementation, and evaluation of the SUMMA Platform, a scalable multilingual media monitoring platform that combines ""shallow"" media stream processing with ""deep"" natural language processing (NLP). The main components of the SUMMA platform are: (1) multilingual stream processing, including speech recognition, machine translation, and story identification; (2) knowledge base construction and fact-checking, based on entity and relation extraction; (3) natural language understanding, including deep semantic parsing, summarisation, and sentiment detection; (4) visualisations, based on multiple views (e.g. topic, person, or timeline); and (5) a backend database to aggregate and process the monitored content.

The platform is currently focused on three use cases: (1) External media monitoring, exemplified by the operations of BBC Monitoring; (2) Internal media monitoring – the efficient management of multilingual content creation within an organisation, exemplified by the operations of Deutsche Welle; and (3) Data-driven media analysis based on the analysis of media content as data and including geolocation, fake news and fact checking, and integration with commonly used tools such as Slack.

The work performed during the project, and the main achieved results are summarised by the following completed milestones: Data dump architecture operational (MS1); Requirements analysis for use cases (MS2 and MS11); Live streams operational (MS3); Architectures and APIs of SUMMA platform agreed (MS4); Demonstration and initial evaluation of stream processing technologies (MS5); Demonstration and initial evaluation of multilingual entity recognition, linking, coreference (MS6); Demonstration and initial evaluation of semantic role labelling (MS7); Development of prototypes for use cases (MS8, MS14, and MS15); Internal release of SUMMA platform (MS9); Scalability tests of the SUMMA platform (MS12 and MS17); Release of component tools (MS13 and MS18); Two News Hack events using the SUMMA platform (MS10 and MS16); Establishment of a project user group and three user group dissemination events (MS20); Final use case demonstrators (MS19); and the construction of a project sustainability plan (MS21).

SUMMA has progressed the state of the art in several dimensions:

1. Data Collection and Management.
SUMMA provides both a mechanism to supply training data for broadcast media in much greater volumes than previous efforts, and provides live-stream test data at scale. The scale of this data provides a basis for the progress that made in the technology areas described below.

2. Large-Scale Machine Learning and Prediction.
SUMMA developed: (1) new combinatorial formulations of structured prediction problems that are amenable to fast approximate decoding algorithms; (2) statistical and neural learning approaches for language and speech data; (3) fast and scalable online clustering algorithms for detecting and tracking story lines that cluster together related news articles; (4) approaches using a large pool of unlabeled data for which some sort of weak supervision is possible; (5) lightly supervised approaches to develop speech recognition models for new languages and dialects.

3. Speech Recognition.
SUMMA has improved core acoustic and language modelling, developed effective adaptive approaches, and ported systems to new languages, in the context of different levels of training data resource,

4. Machine Translation.
SUMMA delivered high-quality, scalable, adaptable machine translation systems over several language pairs, developing and evaluating novel neural machine translation approaches.

5. Segmentation, Clustering and Topic Detection.
SUMMA developed a unified multilingual framework to group incoming news articles into tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve.

6. Natural Language Understanding.
SUMMA: (1) developed statistical and neural semantic parsers that go beyond sentences to operate at storyline level; (2) generated story highlights by taking the output of the semantic parser, and synthesising a coherent summary of events that occur in the story; (3) performed sentiment analysis for a storyline.

7. Automated Knowledge Base Construction.
SUMMA: (1) developed multilingual entity recognition, linking, and coreference resolution; (2) carried out relation extraction across multiple languages; (3) extended knowledge bases to new relations; (4) developed a new activity on automated fact checking.

8. Scalable Platform.
SUMMA developed a flexible, modular platform that integrates the component technologies with a focus on the project use cases.

The impact of the project arises from the development of the efficient, scalable, multilingual platform, that has a flexible, modular construction, and includes components with state-of-the-art accuracy. These four dimensions open the door for the use of deeper linguistic processing coupled with media stream processing in the above areas, which up to now rely, to a large extent, on manual labour or in extremely shallow automatic techniques. To assess and validate progress in these four directions, rigorous experiments will be carried out in the context of the use cases.
Technical overview of the SUMMA project