The concept behind developing an audio-visual search engine is, on the face of it, rather simple. It addresses a fundamental weakness of computers up to now: while they are experts at finding words in text, finding objects in images and video is another matter. To understand why, think about how much interpretation is directly encoded in writing: while we are speaking on the phone, we are creating a physical signal. But by the time this information is written down, this physical signal has been encoded into a series of digital symbols ― the letters ― placed one after the other. Computers are very efficient at manipulating these symbols because they don't really have to interpret them but only find patterns among them. But this is not true for video. Imagine, for example, 10 distinct video snippets of cats. A textual description of their content would be very easy to search because (in English) we would use the word 'cat' to describe each of them. But in each of the snippets the set of pixels that depicts the cat is going to be very different in terms of shape, size and colour. It is very difficult for a computer to recognise that these very different sets of pixels all depict the same kind of object, a cat. In order to address this problem, the EU-funded project 'Interactive semantic video search with a large thesaurus of machine-learned audio-visual concepts' (Vidivideo) has developed an interactive semantic video search with a large thesaurus of machine-learned audio-visual concepts. Vidivideo is a research project and as such does not have the goal or the resources to solve this problem in its entirety. Rather, it sought to provide the building blocks to enable computers to identify ― with speed, consistency and accuracy ― what an object is in video format. 'We have been working on video analysis for a long time,' says Marcel Worring, associate professor at Amsterdam University and one of the coordinators of the Vidivideo project. 'But we found that there were things missing. There are three levels to video analysis: breaking up the video into shots, trying to describe what is in the video, and finally machine-learning. We felt that the shot segmentation could be done better, and wanted to work with the top experts in the world on machine-learning. We also wanted to add another element that was missing: speech and audio.' This was the impetus behind the Vidivideo project. There is certainly a lot of video out there. Every minute, for example, more than 24 hours of video is uploaded onto YouTube. In order to keep up and make sense of what all this content is about, we need to develop systems that work very fast. 'A major challenge is speed and scalability,' said Prof. Worring. 'The tools we have now are far more accurate, but it still takes computation time. We have to train our systems by example videos for which expert users have labelled the content, and this is a time-consuming task.' Part of the solution is to let the system perform its task in parallel with lots of computers. But the Vidivideo team also realised that using a system with modular architecture would also be very important: you start with a little bit of intelligence, and add more as it becomes available. But how does Vidivideo, which received funding under the EU's Sixth Framework Programme for ICT research, work? Imagine you have a group of people watching a video of a complicated procedure, such as assembling a Japanese printer. The first two people recognise that the scene contains a printer. The third person comes in and recognises where the cartridge is, while the fourth person (who can read Japanese) recognises the make of the cartridge and so on. At every point, there is something more to say about the printer, something that makes the picture more precise. Vidivideo functions in exactly the same way. Up to 1000 specialist modules have been developed, which look at a video at the same time. When one of them recognises what they have been trained to recognise, they flag it up. On their own, these modules are not generally intelligent, but working together, they provide a more and more complete picture. Another advantage of Vidivideo is that its architecture is highly flexible, allowing scientists and researchers to add modules at will to the collective intelligence of the system. At the start of the project in 2007, there were about a hundred; by completion at the beginning of 2010 there were over 1000. Vidivideo also contains audio modules which have been trained to recognise a large number of different sounds, from birds and gunshot to rain and thunder. The search engine has been validated with end-users in the fields of broadcasting, surveillance and cultural heritage. The search engine has proven its quality in the three major international benchmarks in the field namely Trecvid, 'Pascal VOC', and Imageclef. In all three benchmarks the Vidivideo search engine received the top rank in automatic image/video annotation, while at Trecvid it also ranked first in interactive search. Some of the partners involved with the project have gone on to work on the 'safer internet' project I-Dash in order to help in the fight against child pornography. This is serious organised crime: thousands of videos are often produced by the same source. Vidivideo technology helps establishing connections across videos. For example, the same visual detail ― a plant, a piece of furniture ― may appear in more than one video. This tool therefore allows officers to bunch together videos they think were filmed in the same room, potentially helping them to identify the location of the criminals. Surveillance is another area of huge potential. Until now, the detection of physical objects in video has been emphasised, but Vidivideo can also be used to recognise forms of behaviour. For instance, someone walks onto a stage with a suitcase, and walks off without one. This change can be picked up. Such possibilities could be interesting as a police application to counter terrorism. When you consider that in the UK there are more than 4 million CCTV cameras it is clear that technology providing at least a first level of interpretation would be useful. In many city centres, there is the threat of violence, especially late at night. Vidivideo could be trained to identify certain precursors to violence, such as raised voices, or aggressive movements, before trouble begins. Another, perhaps more mundane but equally significant opportunity opened up by this technology is effective audio-visual archiving. Documentary makers looking for specific examples of video would be able to zero in quicker on exactly what they are looking for, and the same goes for public platforms such as YouTube. What if your search query for 'cat' was based not on how videos are labelled but on the actual-visual content itself? Experiments with social websites have already shown that this technology has enormous potential. Vidivideo promises a future that not only capitalises on our digital audio-visual world, but also one in which the barriers and limitations of language are significantly removed.