European Commission logo
English English
CORDIS - EU research results
CORDIS

Video Content Description System

Final Report Summary - VISION (Video Content Description System)


The VISION system analyses broadcast content as it happens, in real time, automatically grouping it into pre-defined categories for instant review.

So, in a football match, for example, all the goals, close misses, saves, fouls etc., can be grouped as they happen for the viewer to review at will - all while the game continues live. Viewers could also use the system to watch a summary of the match so far if they tune in late. The underlying technology platform could be used for other exciting applications such as movies or soap operas.

Television viewers may soon have a more efficient, intelligent way of browsing through the content on their television set-top boxes, thanks to VISION IEF FP7 project funded by the Marie Curie Intra-European Fellowship.

Media boxes that can store large amounts of audio visual content are fast becoming commonplace in many households. The VISION project has identified a gap in the market for ‘intelligent’ set-top boxes that are able to structure and index live broadcasts into a set of pre-defined events. For sports, this could be the highlights of a football match, such as goals, close misses and controversial events.

The VISION project, which began in June 2011 and ended in 30 May 2013, aims to design and verify a hardware system that could extract state-of-the-art video features and use them as a basis for structuring and indexing.

The project was conceived by Dr Rafal Kapela, a silicon / CAD development engineer at Poznan University of Technology and Professor Noel E. O'Connor, a Principal Investigator in CLARITY: Centre for Sensor Web Technologies (www.clarity-centre.org) at Dublin City University, Ireland. The collaboration brought together Professor O’Connor’s previous experience developing algorithms for information extraction and Dr Kapela’s background in creating hardware and embedded systems.

These days, anybody can become a content producer and anyone can make their content available online. Of course, this has lead to the well-known information access problem, and now the key challenge is finding the right piece of information at the right time. The group in DCU has a long track record of analysing audio visual content with a view to extracting useful actionable knowledge – in other words taking the raw audio, video and images and understanding what they mean in a real world scenario. We did this across multiple domains; one particular domain was sports broadcasts, where there is a strong market for accessing content in intuitive, flexible and easy ways.

A great deal of research has been conducted in the area of content structuring, particularly in sports broadcasts. Most of that work has focused on football, thanks to its worldwide popularity. However, most research techniques don’t translate to other sports or domains.

The algorithms developed within VISION addressed this challenge by being sufficiently generic to be used in analysing any field sports. The techniques we developed work for any sport that falls under the definition of: ‘a match held on an indoor or outdoor field of play and involving two teams’. This covers rugby, football, cricket, basketball, baseball and many others.

The initial DCU algorithms were able to extract useful information, but they did not run in a realistic timeframe. The algorithms were quite computationally intense. Heavy-duty computers were needed to generate results, as they were required to perform quite sophisticated processes in order to analyse the audio and visual signals. They used a MATLAB environment to model and evaluate their prototype algorithms and then looked at other technology that would enable the real time processing. The program developed in VISION utilises free libraries that can be downloaded from the internet without any restrictions. In addition, we are using some hardware accelerators for various processes, such as decoding or encoding video and for image processing operations.

The final stage of the project was to deploy these algorithms to the hardware platform and make use of the hardware accelerators. By the end of the project’s timeframe, we had created a hardware-friendly system that can be embedded into set-top devices. This will create smarter set-top boxes that can not only store and play content, but also analyse it and enable users to access it in a non-linear fashion. So, for example, a viewer who tunes in late to a live match could use their smart set-top box to generate an automatic summary of the match so far.

The future implementation of the VISION project’s hardware has numerous other possibilities. At the moment, the team’s research is focused on sports analysis but the techniques and technology could be applied in other audio visual domains as well, such as movies and soap operas.