Foreground image retrieval is a fundamental task in computer vision. Given an image of the background scene with a bounding box indicating the target location, the goal is to retrieve a set of images of foreground objects from a given category, which are semantically compatible with the background. We formulate foreground retrieval as a self-supervised domain adaptation task, where the source domain consists of foreground images and the target domain of background images. Specifically, given pretrained object feature extraction networks that serve as teachers, we train a student network to infer compatible foreground features from background images. Thus, foregrounds and backgrounds are effectively mapped into a common feature space, enabling retrieval of the foregrounds that are closest to the target background in that space. A notable feature of our approach is that our training strategy does not require instance segmentation, unlike current state-of-the-art methods. Thus, our method may be applied to diverse foreground categories and background scene types and enables us to retrieve the foreground in a fine-grained manner, which is closer to the requirements of real world applications