VidSTR: Automatic SpatioTemporal Retargeting of Speech-Driven Video Compositions

Video editors often record multiple versions of a performance with minor differences. When they add graphics atop one video, they may wish to transfer those assets to another recording, but differences in performance, wordings, and timings can cause assets to no longer be aligned with the video content. Fixing this is a time consuming, manual task. We present a technique which preserves the temporal and spatial alignment of the original composition when automatically retargeting speech-driven video compositions. It can transfer graphics between both similar and dissimilar performances, including those varying in speech and gesture. We use a large language model for transcript-based temporal alignment and integer programming for spatial alignment. Results from retargeting between 51 pairs of performances show that we achieve a temporal alignment success rate of 90% compared to hand-generated ground truth compositions. We demonstrate challenging scenarios, retargeting video compositions across different people, aspect ratios, and languages.

Learn More

Publications

VidSTR: Automatic SpatioTemporal Retargeting of Speech-Driven Video Compositions

CHI 2025

Publication date: April 26, 2025

Joshua Kong Yang, Mackenzie Leake, Jeff Huang, Stephen DiVerdi

Research Area: Human Computer Interaction