AI Can Generate Realistic Sound for Video Clips

April 30, 2018

By Meredith Alexander Kunz, Adobe Research

A good live-action video captures moving images and clear audio. Some videos, however, lack an original sound recording, and some have poor quality audio. Typically, creating a realistic soundtrack for a video can be tough and time-consuming.

That’s why Adobe Research scientists and collaborators decided to investigate whether neural networks, using AI and deep learning techniques, could figure out how to generate realistic sound for videos from scratch—audio that could meet human standards of naturalism.

Their conclusion? Yes—and in more than 70 percent of cases, the system’s generated soundtracks “fooled” humans into thinking they were real.

Check out this video to see if you can tell which sounds are real and which are generated by machine:

Adobe Research intern Yipin Zhou, from the University of North Carolina at Chapel Hill, along with Adobe Research scientists Zhaowen Wang, Chen Fang, Trung Bui, and Zhou’s advisor, Tamara L. Berg, designed this machine-learning model to generate realistic soundtracks for short video clips.

To ensure good training, the team created a large annotated dataset of video examples. The team drew on short videos derived from more than two million publicly available clips. These videos are divided into labeled categories including dogs, chainsaws, and helicopters.

The research team asked crowdsourced workers to find clips in which the audio source is visible and prominent in the soundtrack. This process created the new dataset of over 28,000 videos, each about seven seconds in length, covering 10 different categories.

The researchers then trained their system to recognize the sound waveforms associated with each category and to re-create them from scratch. They tested the computer’s results by asking human evaluators to rate the quality of the video’s sound and to distinguish if the audio was real or fake.

“Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs,” Zhou told MIT Technology Review. “Evaluations show that over 70 percent of the generated sound from our models can fool humans into thinking that they are real.”

This work will be presented at June’s CVPR 2018.


Zhaowen Wang, Chen Fang, Trung Bui (Adobe Research)

Yipin Zhou, Tamara L. Berg (University of North Carolina at Chapel Hill)

Recent Posts