Designing a unified vision model capable of handling diverse visual transformation tasks without task-specific modifications remains a significant challenge, particularly in scaling and generalizing beyond narrowly defined objectives. We propose GENIE, a novel ControlNet-Diffusion framework that performs task-based image generation solely through visual exemplars, eliminating dependence on textual prompts or auxiliary metadata. Unlike conventional prompt-driven diffusion models, GENIE employs a dual visual conditioning mechanism—combining implicit guidance via ControlNet and explicit task encoding through CLIP-based visual arithmetic—to infer task intent directly from reference inputoutput pairs. To improve semantic alignment between visual exemplars and generated outputs, we introduce a lightweight task consistency loss, which encourages representational coherence in the embedding space across transformed pairs. While not a multitask learner in the classical sense, GENIE employs a task-agnostic architecture that enables task switching across multiple image-to-image transformations without any task-specific modifications to the model architecture or loss functions. Instead of being explicitly provided with task identifiers, the model infers the intended task implicitly from the reference input–output pair through visual conditioning. Evaluations across seven vision tasks—inpainting, colorization, edge detection, deblurring, denoising, segmentation and depth estimation—and four out-of-distribution (OOD) tasks on OOD data—deraining, contrast enhancement , map to aerial and scribble to image generation—demonstrate that GENIE achieves an average performance gain of 7.13% over other visual prompt conditioned baselines, showcasing its effectiveness for scalable and text-free visual generation.
Learn More