Visual AI: From Babies to the Movies

To round off May, Week 5’s fantastic Tuesday Talk was delivered by Professor Andrew Zisserman from the University’s Department of Engineering Science to a near-packed hall. Unusually, this was split into two parts. The first part focused on how his research group and collaborators have developed models which can learn how to process audio-visual inputs; the second part  discussed a practical application of these models. And, as ever, the evening was made complete by the tireless efforts of the Reuben catering staff!

Learning like babies

Professor Zisserman began by asking a simple question: can a machine learn to tell the source and location of a sound simply by exposure to lots of stimuli, letting it make connections (in a similar manner to how a baby learns)? In this case, the stimuli are thousands of video clips containing both audio and visual elements.

He showed us his group’s models, which could accurately detect lips that were moving in sync with sounds to indicate where the sounds were coming from, even in crowds of other voices. This wasn’t limited to talking; musical instruments could also be distinguished. Stunningly, these models could then be improved further to learn the meaning of the words by analysing their different features, identifying elephants, trees, and snow, among other things.

In the short interval, we were treated to roasted butternut squash with a pearl barley salad and black peas. This was a meal we (at my table at least!) made short work of, before focusing back on the talk.

Could this help the visually impaired?

Having showcased the models that could identify the key features of video clips, Professor Zisserman now focused on their application. In this case, how they might be used to generate high-quality audio descriptions of films, for the visually impaired.

This comes with a significant step up in difficulty: models must identify the key features in any scene, integrate this with how it fits in the film’s wider context, and present it succinctly to fit within gaps between speech.

The first key feature for a useful description is identifying which characters are present. For this, the model matches the faces it can see against information in the film’s IMDB web-page. By naming the characters and summarising their actions and interactions, the model was able to succinctly describe individual clips. However, the model struggled to make sense of scenes built from multiple camera shots. By asking the model to look at a wider context of several shots, and how they thread together to build a 3-dimensional environment, Professor Zisserman was able to make it give a description more befitting of the story being told.

Finally, to add the necessary atmosphere of a film, his model also considered the type of camera a shot used. For example, the model could explain why there was a close-up in a particular scene and what it meant the viewer should focus on.

Bringing it all together with dinosaurs

There was one final surprise in store: dinosaurs! The talk ended with clips of one of the greatest dinosaur movies ever made, complete with computer-generated audio descriptions. It brought a whole new way to experience The Land Before Time. And to prove that it wasn’t all about Littlefoot, Ducky, Petrie, and friends, Professor Zisserman also applied the model to Jurassic Park. With that, it was time to have some pineapple cake and reflect on another great Tuesday Talk.