Towards 3D-Consistent Video Generators

Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator (OpenSora in our case) can support camera pose estimation. Surprisingly, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named JOG3R, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.

Learn More

Publications

Towards 3D-Consistent Video Generators

The British Machine Vision Conference (BMVC 2025)

Publication date: November 24, 2025

Chun-Hao Paul Huang, Niloy J. Mitra, Hyeonho Jeong, Jae Shin Yoon, Duygu Ceylan

Research Areas: AI & Machine Learning Computer Vision, Imaging & Video Content Intelligence