We present ImProvShow; a novel approach to summarizing the multi-stage edit history (or `provenance') of an image. ImProvShow fuses visual and textual cues to succinctly summarize multiple manipulations applied to an image in a sequence; a novel extension of the classical image difference captioning (IDC) problem. ImProvShow takes as input several intermediate thumbnails of the image editing sequence, as well as any coarse human or machine-generated annotations of the individual manipulations at each stage, if available. We demonstrate that the presence of intermediate images and/or auxiliary textual information improves the model's edit captioning performance. To train ImProvShow, we introduce METS (Multiple Edits and Textual Summaries) - a new open dataset of image editing sequences, with textual machine annotations of each editorial step and human edit summarization captions after the 5th, 10th and 15th manipulation.
Learn More