Research by: Mina Huh (UC Berkeley), C. Ailie Fraser (Adobe Research), Dingzeyu Li (Adobe Research), Mira Dontcheva (Adobe Research), Bryan Wang (Adobe Research) · Presented at ACM CHI 2026
This research was conducted during Mina Huh’s internship at Adobe Research. Note: VidTune is experimental academic research. It is not a current or announced Adobe product feature or capability.

Key takeaways
- Video creators struggle to find soundtracks that match their video’s mood. That’s not because the music doesn’t exist, but because reviewing and comparing audio options is time-consuming and hard to do quickly.
- The experimental technology VidTune generates multiple music tracks from a text prompt and produces visual “contextual thumbnails” for each one, letting creators compare options at a glance rather than listening sequentially.
- Thumbnails encode musical attributes — mood, energy, genre, and instrumentation — as visual cues grounded in the video’s own subjects and imagery.
- Creators can refine results with natural language edits (“make it more energetic,” “change the genre”), which VidTune expands into new generations.
- User studies found the tool effective for browsing and comparing options, with participants describing the experience as playful and enriching.
- VidTune is the latest prototype in a body of HCI research on how creators can more easily add generative music to their videos. Last year the team published a paper at ACM DIS 2025 that explored vibe-based prompt recommendations and structured refinement of music output.
Picking the right soundtrack for a video is one of those creative decisions that looks simple from the outside but turns out to be surprisingly hard. The music needs to match the mood, the pacing, the story — and when you’re working with text-to-music generation tools, you quickly run into a practical problem: you have to listen to everything. Track by track, sequentially, hoping something clicks.
A new experimental research system from researchers at Adobe Research and UC Berkeley, called VidTune, tackles this challenge by bringing visual thinking to an auditory problem. It’s being presented at CHI 2026, a premiere conference in human computer interaction.
The problem with listening
Current text-to-music tools ask creators to describe what they want in a prompt, then surface results as waveforms or audio players. For non-music experts — which describes most video creators — this creates a bottleneck. A formative study with eight creators found three recurring friction points: difficulty constructing diverse prompts, limited ability to quickly review and compare tracks, and uncertainty about how a track would actually feel against their footage.
VidTune is designed to address all three.

What VidTune does
The system works in three stages. First, it helps creators build and expand their prompts, suggesting relevant terms based on their video content to broaden the range of generated options. Second, it generates multiple music tracks in parallel and produces a contextual thumbnail for each one — a short animated visual that maps the track’s musical attributes onto imagery drawn from the creator’s own video.
Those thumbnails are the central innovation. Rather than presenting music as an abstract waveform, VidTune visualizes each track’s valence and energy through visual cues like color and brightness, and depicts genre and prominent instruments through the style and composition of the image. The subjects in each thumbnail are drawn from the actual video content, giving creators a meaningful, scannable preview they can assess at a glance.
In the third stage, creators can refine tracks through natural language edits. A phrase like “make it more energetic” or “change genre” triggers VidTune to expand the instruction into new generation parameters, producing a fresh set of options without requiring the creator to rebuild their prompt from scratch.
What the research found
The researchers evaluated VidTune in two studies: a controlled user study with twelve participants and an exploratory case study with six. Both groups found the system helpful for reviewing and comparing music options efficiently. Participants described the contextual thumbnails as useful for understanding how tracks would feel against their footage — something waveforms and generic cover art fail to convey. The overall experience was described as playful and enriching.
Read the paper here: arxiv.org/abs/2601.12180 and visit the VidTune project page to learn more.
Beyond VidTune: Research on generative music for videos
VidTune builds on a broader line of HCI research exploring how to design generative music experiences for video creators. In prior work presented at ACM DIS 2025 (in collaboration with Carnegie Mellon University), the team investigated how “vibe-based” prompt suggestions and structured refinement of music output can help users better express their musical intent, explore possibilities, and iteratively evolve their output. That work helped shape how prompts are scaffolded and expanded in VidTune. Together, these projects point toward a consistent direction for HCI in generative music: designing interfaces that help people explore creative directions, articulate intent, and develop a sound that fits their unique story.
Video in figures created by Gabi Duncombe.