Object-WIPER: A Smarter Way to Erase Objects from Video

Research Authors: Saksham Singh Kushwaha (UT Dallas), Sayan Nag (Adobe Research), Yapeng Tian (UT Dallas), Kuldeep Kulkarni (Adobe Research)
Research accepted for publication at: CVPR 2026 — IEEE/CVF Conference on Computer Vision and Pattern Recognition
Project Page and Paper: Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
Note: This is academic research being presented at a peer-reviewed conference. It is not a product feature or capability.

Key Takeaways

Beyond the object: The experimental project Object-WIPER removes not just unwanted objects from video, but also their associated visual effects — shadows, reflections, and translucent traces — producing cleaner, more believable results than prior methods.

No retraining required: The framework is training-free, leveraging a pre-trained text-to-video diffusion transformer to understand and erase both objects and their effects without any task-specific fine-tuning.

A new benchmark: The team introduces WIPER-Bench, a curated real-world dataset for evaluating object removal with associated effects, along with a new metric designed to measure temporal consistency and scene coherence.

Imagine you’re editing footage of a parkour athlete leaping across a rooftop, and you need to remove a bystander who wandered into the shot. You mask them out, but their shadow keeps sweeping across the concrete, frame after frame, a ghost that gives away the edit. Cleaning that up manually means hours of frame-by-frame work, the kind of task that has traditionally required either a professional VFX pipeline or a lot of patience.

Object-WIPER is an experimental research project accepted to CVPR 2026. Rather than treating object removal as a simple mask-and-fill task, it recognizes that objects leave visual traces in a scene — shadows on the ground, reflections in glass, translucent overlaps with the background — and that a complete removal means erasing those traces too.

The key insight is that a pre-trained text-to-video diffusion transformer (DiT) — a model trained on large video datasets and already rich with knowledge about how light, materials, and objects interact — can be steered to identify and remove those effects without any additional training. A user provides two things: a rough mask around the object they want gone, and a short text phrase describing the object and its effects (for instance, “a dog and its shadow” or “a glass vase and its reflection”). The system uses visual-text cross-attention and visual self-attention, mechanisms that link the text description to specific regions of the video’s internal representation, to automatically locate the visual traces associated with the object. It does so even when the traces extend far beyond the original mask. These regions are fused into a final removal mask covering both object and effects.

The method then inverts the video through the DiT to extract structured noise encoding the scene’s underlying structure. Tokens in the removal mask are replaced with fresh Gaussian noise — effectively erasing the object and its traces — while background tokens are preserved. As the model reconstructs the video during denoising, those saved background values are reintroduced, keeping the surrounding scene temporally stable and visually coherent across frames.

This approach lowers the barrier for video cleanup considerably. Content creators, journalists, and independent filmmakers can remove unwanted elements from footage without frame-by-frame compositing or specialized VFX software, instead using just a mask and a text phrase.

To support rigorous evaluation, the team built WIPER-Bench, a real-world benchmark covering the four most common associated effect types: shadows, reflections, translucency, and mirror effects. They also introduce a new metric that jointly measures temporal consistency across frames, foreground-background coherence within frames, and how thoroughly the original content has been removed. Experiments on DAVIS and WIPER-Bench show Object-WIPER outperforming both training-based and training-free baselines, including Attentive-Eraser, ROSE, GenProp, and ProPainter.

This work grew from a 2025 summer internship at Adobe Research, a testament to the kind of research that can emerge when a talented graduate student gets hands-on time with a world-class lab. As object-level video understanding continues to mature, experimental methods like Object-WIPER point toward a future where clean, professional-quality editing may one day be within reach for anyone with a camera. Code, models, and WIPER-Bench will be publicly released.

Check out the paper and project page to learn more and watch the Object-WIPER video demo.

Wondering what else is happening inside Adobe Research? Check out our latest news here.

Object-WIPER: A Smarter Way to Erase Objects from Video

April 30, 2026

Tags: AI & Machine Learning, Computer Vision, Imaging & Video, Conferences

Recent Posts

TeamFusion: Using AI Agents to Represent Every Voice in Team Decisions

TeamFusion is a multi-agent framework that supports team decision-making by creating AI proxy agents grounded in each member’s preferences, so that humans can make better-informed decisions together.

Agentic Design Review System: Teaching AI to review graphic designs the way experts do

Our Agentic Design Review System (Agentic-DRS) is an experimental multi-agent framework from Adobe Research for holistic graphic design evaluation.

Adobe Research at Summit 2026: Agentic AI for orchestrating customer experiences

Adobe Research was behind some of the most exciting announcements, sessions, and sneak previews at Adobe Summit - from the unveiling of new tools that help marketers understand their customers’ behavior and design customer experiences at scale to hands-on agentic AI sessions.