By Meredith Alexander Kunz, Adobe Research
We all have long, mostly dull home videos in our memory cards that contain just a few great moments. What if we could train computers to find the specific scenes we’re looking for?
Today, Adobe Research scientists are partnering with University of California, Berkeley, to turn computers to this task. They have created a new system that allows a user to pour through personal videos for specific moments using natural language to search. Their work forms the basis of a paper presented at the 2017 International Conference on Computer Vision (ICCV).
Lisa Anne Hendricks, a PhD student at the University of California, Berkeley and a former Adobe Research intern and Adobe 2017 fellowship winner, worked on this project with Adobe Research’s Bryan Russell and Oliver Wang, research scientists, Eli Shechtman, principal scientist, and Josef Sivic, consulting senior research scientist. Hendricks’ advisor, Trevor Darrell, also contributed to the work.
Find Your Moment
As Russell explains, the basic problem that the research team defined is this: “Given an unedited personal video, can you retrieve a specific moment with a natural language query?”
The research team wanted users to be able to employ regular words that most people speak and write—also known a natural language—to look for specific scenes in videos. Ideally, a system would allow a user to search for a moment with a complex phrase, such as “show me when the girl turns a cartwheel after falling down.”
To get good answers to these queries, the team needed to collect data and create an algorithm that can handle complex natural language queries, such as those with “temporal language” – descriptions of the timing of a particular moment (such as “after, before, while”). With this in mind, they would leverage the latest advances in natural language processing and computer vision to help computers better understand videos.
To Train Computers, Feed Them Good Data
The researchers focused on training their network using deep learning, a method in which complex layers of neural networks are trained to understand data on their own. They had a problem, however. Machines can only learn when they have good, relevant data to review and digest. Existing video datasets were sorely lacking when it came to unedited home movies that contained just a few action-filled moments. If data is “the food,” as Russell calls it, the researchers would need to cook their own.
“We wanted to work with unedited personal videos, because people will want to use this kind of technology with their mobile phone video. Edited videos have selected chunks, but unedited videos have lots of boring moments to sift through,” Russell says. The researchers turned to a wide range of personal videos available publically on an online sharing platform with creative commons licenses.
Critical to this work would be the labels that align moments in videos with corresponding natural language phrases used to explain them that could be understood by computers for future queries. Using crowdsourcing, the research team hired people to view and label moments in the videos, associating natural language descriptions with moments.
The researchers made the project easier for their human workers by breaking longer videos into a series of GIFs that could be quickly labeled. They then verified the data, since some people may have different definitions of exactly when something happens in a video.
New Dataset: DiDeMo
That meticulous work resulted in a pioneering new dataset, Distinct Describable Moments (DiDeMo). It is made up of over 10,000 personal videos—complete with “moments” and 40,000 natural language descriptions of their contents.
DiDeMo became the ideal training data to feed into neural networks to test whether a computer could identify a moment. Using deep learning, the computers were taught to match moments to their natural language descriptions—with very encouraging results, Russell says.
How does the computer “think through” this search? The new algorithm selects candidate moments and then looks at those, as well as the whole video. In a series of neural networks, the machine compares these moments to the words used to describe the action that the user is searching for. After scoring a set of moments, the computer matches the best one to a given phrase, such as “the dog reaches the top of the stairs” or “the car passes closest to the camera.”
The network’s output are the start and end points of a scene, visualized as a green frame identifying the specific moment in the video. The team calls this the Moment Context Network (MCN).
Advancing Video Research
The paper breaks new ground on two levels, says Russell.
“First, we introduced a new dataset for video,” he explains, one that could help advance research more generally. “Second, we created a model that, given a video and the natural language input, retrieves a moment in a video.” The group’s demo shows the power of this approach as green boxes encase very specific actions during a longer video.
This work is still in its early stages, Russell explains: “This algorithm is a proof of concept.” Progress on this “hard problem” could eventually allow people to more easily search through the large stores of personal videos for specific moments they want to watch, share, or use in a video-editing project.