When the Music Fits: VidTune Helps Video Creators Find the Right Soundtrack

Research by: Mina Huh (UC Berkeley), C. Ailie Fraser (Adobe Research), Dingzeyu Li (Adobe Research), Mira Dontcheva (Adobe Research), Bryan Wang (Adobe Research) · Presented at ACM CHI 2026

This research was conducted during Mina Huh’s internship at Adobe Research. Note: VidTune is experimental academic research. It is not a current or announced Adobe product feature or capability.

Key takeaways

Video creators struggle to find soundtracks that match their video’s mood. That’s not because the music doesn’t exist, but because reviewing and comparing audio options is time-consuming and hard to do quickly.

The experimental technology VidTune generates multiple music tracks from a text prompt and produces visual “contextual thumbnails” for each one, letting creators compare options at a glance rather than listening sequentially.

Thumbnails encode musical attributes — mood, energy, genre, and instrumentation — as visual cues grounded in the video’s own subjects and imagery.

Creators can refine results with natural language edits (“make it more energetic,” “change the genre”), which VidTune expands into new generations.

User studies found the tool effective for browsing and comparing options, with participants describing the experience as playful and enriching.

VidTune is the latest prototype in a body of HCI research on how creators can more easily add generative music to their videos. Last year the team published a paper at ACM DIS 2025 that explored vibe-based prompt recommendations and structured refinement of music output.

Picking the right soundtrack for a video is one of those creative decisions that looks simple from the outside but turns out to be surprisingly hard. The music needs to match the mood, the pacing, the story — and when you’re working with text-to-music generation tools, you quickly run into a practical problem: you have to listen to everything. Track by track, sequentially, hoping something clicks.

A new experimental research system from researchers at Adobe Research and UC Berkeley, called VidTune, tackles this challenge by bringing visual thinking to an auditory problem. It’s being presented at CHI 2026, a premiere conference in human computer interaction.

The problem with listening

Current text-to-music tools ask creators to describe what they want in a prompt, then surface results as waveforms or audio players. For non-music experts — which describes most video creators — this creates a bottleneck. A formative study with eight creators found three recurring friction points: difficulty constructing diverse prompts, limited ability to quickly review and compare tracks, and uncertainty about how a track would actually feel against their footage.

VidTune is designed to address all three.

What VidTune does

The system works in three stages. First, it helps creators build and expand their prompts, suggesting relevant terms based on their video content to broaden the range of generated options. Second, it generates multiple music tracks in parallel and produces a contextual thumbnail for each one — a short animated visual that maps the track’s musical attributes onto imagery drawn from the creator’s own video.

Those thumbnails are the central innovation. Rather than presenting music as an abstract waveform, VidTune visualizes each track’s valence and energy through visual cues like color and brightness, and depicts genre and prominent instruments through the style and composition of the image. The subjects in each thumbnail are drawn from the actual video content, giving creators a meaningful, scannable preview they can assess at a glance.

In the third stage, creators can refine tracks through natural language edits. A phrase like “make it more energetic” or “change genre” triggers VidTune to expand the instruction into new generation parameters, producing a fresh set of options without requiring the creator to rebuild their prompt from scratch.

What the research found

The researchers evaluated VidTune in two studies: a controlled user study with twelve participants and an exploratory case study with six. Both groups found the system helpful for reviewing and comparing music options efficiently. Participants described the contextual thumbnails as useful for understanding how tracks would feel against their footage — something waveforms and generic cover art fail to convey. The overall experience was described as playful and enriching.

Read the paper here: arxiv.org/abs/2601.12180 and visit the VidTune project page to learn more.

Beyond VidTune: Research on generative music for videos

VidTune builds on a broader line of HCI research exploring how to design generative music experiences for video creators. In prior work presented at ACM DIS 2025 (in collaboration with Carnegie Mellon University), the team investigated how “vibe-based” prompt suggestions and structured refinement of music output can help users better express their musical intent, explore possibilities, and iteratively evolve their output. That work helped shape how prompts are scaffolded and expanded in VidTune. Together, these projects point toward a consistent direction for HCI in generative music: designing interfaces that help people explore creative directions, articulate intent, and develop a sound that fits their unique story.

Video in figures created by Gabi Duncombe.

When the Music Fits: VidTune Helps Video Creators Find the Right Soundtrack

April 9, 2026

Tags: AI & Machine Learning, Audio, Computer Vision, Imaging & Video, Conferences, Human Computer Interaction

Recent Posts

MotionStream demonstrates real-time control in AI video creation

With the new experimental technology MotionStream, video creators can interact with AI-generated video while it’s being created, directing the movement of objects and changing camera angles in real-time with the simple use of a cursor and sliders.

Vidmento: Filling the gaps in your video story with generative AI

Vidmento is an experimental AI-assisted video authoring tool from Adobe Research, accepted at CHI 2026, that helps creators build complete video stories by blending their own captured footage with contextually generated clips — preserving narrative continuity and creative intent throughout.

SIGMA-GEN: Placing any subject, anywhere, exactly where you want it

SIGMA-GEN is an experimental unified framework from Adobe Research, presented at ICLR 2026, for multi-identity-preserving image generation — placing multiple subjects into a scene in a single pass, with each subject's appearance faithfully preserved.