ImProvShow: Multimodal Fusion for Image Provenance Summarization

We present ImProvShow; a novel approach to summarizing the multi-stage edit history (or `provenance') of an image. ImProvShow fuses visual and textual cues to succinctly summarize multiple manipulations applied to an image in a sequence; a novel extension of the classical image difference captioning (IDC) problem. ImProvShow takes as input several intermediate thumbnails of the image editing sequence, as well as any coarse human or machine-generated annotations of the individual manipulations at each stage, if available. We demonstrate that the presence of intermediate images and/or auxiliary textual information improves the model's edit captioning performance. To train ImProvShow, we introduce METS (Multiple Edits and Textual Summaries) - a new open dataset of image editing sequences, with textual machine annotations of each editorial step and human edit summarization captions after the 5th, 10th and 15th manipulation.

Learn More

Publications

ImProvShow: Multimodal Fusion for Image Provenance Summarization

British Machine Vision Conference (BMVC)

Publication date: November 23, 2025

Alexander Black, Jing Shi, Yifei Fan, John Collomosse

Research Areas: AI & Machine Learning Content Intelligence Natural Language Processing