Automatic preference models play an important role in the development of image generation and image editing models, yet existing approaches often lack either generalizability or proficiency. To cope with this limitation, we argue that incorporating judgments from image assessment specialists can equip a VLM-as-a-judge with richer, more reliable signals, ultimately enabling better modeling of human preference while preserving broad generalizability. With the additional specialist assessments, a VLM judge reasons through the assessments, identifies the key deciding factors, and generates the final preference judgment. We extensively experimented on four public benchmarks, covering both image generation and image editing tasks in both pointwise and pairwise preference paradigms. Our specialist-aided in-context learning (ICL) models improve alignment with human preferences by 2% to 8% over their corresponding baseline VLMs. Beyond ICL, we also investigate how specialists can help generate image preference data for VLMs. We reverse-engineer chain-of-thoughts image preference data, given the input, ground-truth preference, and specialists' assessments; then train a VLM model on the synthetic data. Our finetuning model boosts the alignment by a margin of up to 13% over its corresponding baseline VLMs, achieving the best performance on certain benchmarks. Overall, our work extensively shows the potential of combining VLMs with image assessment specialists for reliable image preference modeling.
Learn More