Object inpainting is a task that involves adding objects to real images and seamlessly compositing them. With the recent commercialization of products like Stable Diffusion and Generative Fill, inserting objects into images by using prompts has achieved impressive visual results. In this paper, we propose a prompt suggestion model to simplify the process of prompt input. When the user provides an image and a mask, our model predicts suitable prompts based on the partial contextual information in the masked image, and the shape and location of the mask. Specifically, we introduce a concept-diffusion in the CLIP space that predicts CLIP-text embeddings from a masked image. These diffused embeddings can be directly injected into open-source inpainting models like Stable Diffusion and its variants. Alternatively, they can be decoded into natural language for use in other publicly available applications such as Generative Fill. Our prompt suggestion model demonstrates a balanced accuracy and diversity, showing its capability to be both contextually aware and creatively adaptive.
Learn More