SUMMA has progressed the state of the art in several dimensions:
1. Data Collection and Management.
SUMMA provides both a mechanism to supply training data for broadcast media in much greater volumes than previous efforts, and provides live-stream test data at scale. The scale of this data provides a basis for the progress that made in the technology areas described below.
2. Large-Scale Machine Learning and Prediction.
SUMMA developed: (1) new combinatorial formulations of structured prediction problems that are amenable to fast approximate decoding algorithms; (2) statistical and neural learning approaches for language and speech data; (3) fast and scalable online clustering algorithms for detecting and tracking story lines that cluster together related news articles; (4) approaches using a large pool of unlabeled data for which some sort of weak supervision is possible; (5) lightly supervised approaches to develop speech recognition models for new languages and dialects.
3. Speech Recognition.
SUMMA has improved core acoustic and language modelling, developed effective adaptive approaches, and ported systems to new languages, in the context of different levels of training data resource,
4. Machine Translation.
SUMMA delivered high-quality, scalable, adaptable machine translation systems over several language pairs, developing and evaluating novel neural machine translation approaches.
5. Segmentation, Clustering and Topic Detection.
SUMMA developed a unified multilingual framework to group incoming news articles into tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve.
6. Natural Language Understanding.
SUMMA: (1) developed statistical and neural semantic parsers that go beyond sentences to operate at storyline level; (2) generated story highlights by taking the output of the semantic parser, and synthesising a coherent summary of events that occur in the story; (3) performed sentiment analysis for a storyline.
7. Automated Knowledge Base Construction.
SUMMA: (1) developed multilingual entity recognition, linking, and coreference resolution; (2) carried out relation extraction across multiple languages; (3) extended knowledge bases to new relations; (4) developed a new activity on automated fact checking.
8. Scalable Platform.
SUMMA developed a flexible, modular platform that integrates the component technologies with a focus on the project use cases.
The impact of the project arises from the development of the efficient, scalable, multilingual platform, that has a flexible, modular construction, and includes components with state-of-the-art accuracy. These four dimensions open the door for the use of deeper linguistic processing coupled with media stream processing in the above areas, which up to now rely, to a large extent, on manual labour or in extremely shallow automatic techniques. To assess and validate progress in these four directions, rigorous experiments will be carried out in the context of the use cases.