Periodic Reporting for period 2 - MeMAD (Methods for Managing Audiovisual Data: Combining Automatic Efficiency with Human Accuracy) Reporting period: 2019-07-01 to 2021-03-31 Summary of the context and overall objectives of the project This project has developed novel methods of analysing and describing video content based on a combination of computer vision techniques, human input and machine learning approaches. These descriptions will allow Creative Industries as well as people using their services to access, use and find audiovisual information in novel ways with better metadata. They will be able to locate particular segments in video rapidly and accurately on the basis of searching and browsing in text corpora which have been compiled from audiovisual data aligned with verbal descriptions. Moreover, the intermodal translation from images and sounds into words will attract new users, such as the deaf, hard-of-hearing, blind, and partially-sighted audiences who would else be excluded from the visual or auditory content.The MeMAD consortium has focused especially in TV broadcasting and in on-demand media services. Four main project objectives were:Objective O1: Develop novel methods and tools for digital storytellingObjective O2: Deliver methods and tools to expand the size of media audiencesObjective O3: Develop an improved scientific understanding of multimodal and multilingual media content analysis, linking and consumptionObjective O4: Deliver object models and formal languages, distribution protocols and display tools for enriched audiovisual dataThe results of MeMAD were well aligned to the action ICT-20-2017, developing tools for smart digital content for creative industries in the European broadcasting domain. The research results were world class as proved by our excellent success, first in various scientific benchmarking challenges, and then by the interesting results in novel real-world tasks. In addition to publishing scientific articles, sharing the software and the results in public, we moved the research field forward. By disseminating the results directly to various European broadcasters and their service suppliers, we worked to maximize our impact in the domain of production and distribution of audiovisual content. Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far MeMAD collected a substantial amount of broadcast video data and constructed and licensed datasets for 3rd parties to use after the project has finished. The main MeMAD prototype was created on top of Limecraft’s Flow system, where the new technology components can be operated and the results from one component and user environment can be seamlessly passed to another. We pushed further the state-of-the-art in the generation of description of audiovisual data jointly by automatic visual analysis, speech recognition, audio event detection, speaker diarization and named entity recognition, as well as an innovative way of ingesting legacy metadata, under the form of a knowledge graph. We contributed to the state-of-the-art in multimodal machine translation, where the output description of multimodal events can be provided in multiple languages to improve cross-lingual search. We studied human annotation of video data and audio description and created a human annotated video database for comparative analysis of human and machine descriptions. We contributed to existing semantic metadata standards and applied Linked Data best practices to publish a MeMAD knowledge graph that provides semantic descriptions of broadcast video data. Finally, we facilitated joint work between media industry and researchers, increasing mutual understanding of typical professional workflows, priorities and user needs in both domains. This lays way for deeper future collaboration and ensures the relevance of the project work.Public project deliverables and summary of project publications can be found at https://zenodo.org/communities/memad. Software and results created in the project are available in https://github.com/memad-project. We published a semantic data platform at http://data.memad.eu/. By disseminating the results directly to various European broadcasters and their service suppliers, we maximized our impact in the domain of production and distribution of audiovisual content.After the project, the results will be utilized by business and services provided by the partner companies Limecraft, Lingsoft and LLS and data providers YLE and INA. They will also be utilized in the planned standardization activities and were submitted to EBU, private broadcasters and other agents in the media sector as recommended practices for the use cases we implemented. A significant part of the results were published open access to be exploited by anyone interested. In addition to software and scripts, MeMAD released open benchmark and evaluation datasets for automatic speech recognition, multimodal content analysis and machine translation in the media context. Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far) In Digital Storytelling it is typical that scripts are not available in post-production and not supported by the distribution protocol. MeMAD developed a platform to create and curate an electronic script throughout the production and post-production processes that is turned into subtitles, audio description and clickable on screen captions. For non-scripted content MeMAD makes up a script as the available material is being indexed and as material fragments are selected either manually or automatically as part of the story. MeMAD took a great step forward from the state-of-the-art in automatic content description to propose a semi-automatic video content description, which can be applied to different contexts of use. Providing video description creates new audiences by creating a verbal surrogate that anyone, not only the disabled, can benefit from. The automatic analysis techniques detect visual and auditory elements from multimedia and label them with pre-defined concepts, generate textual description of the content and provide speech recognition. Our work on multimodal machine translation resulted in a new state-of-the-art in image caption translation. Furthemore, our approach to document-level translation has become the de-facto standard for discourse-level machine translation and we have released pre-trained models for subtitle translation for focus languages in the project. We demonstrated the added value of developing a Knowledge Graph for integrating heterogeneous legacy metadata with automatic analysis results. We extended the existing EBU standard used in the media industry and we have proposed an extensible set of interchange formats that re-use well-known vocabularies. We developed multimodal methods for topical segmentation of multimedia content and to align an existing content description with segments thus enabling access to audiovisual content at the fragment level. We developed new methods to predict the memorability of those fragments as a surrogate for assessing their importance, but there is still a significant gap for generalizing them to any type and genre of audiovisual content. Finally, we proposed new methods and systems that perform named entity recognition and disambiguation on noisy transcripts or directly from the speech. We developed innovative explainable methods for extracting topics from audiovisual segments or categorizing and enriching those segments using external information and background knowledge. The proposed interactive human-machine content annotation process.