GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

1ARC Lab, Tencent PCG, 2Institute of Automation, CAS

TL;DR

Recent works demonstrate the feasibility of enhancing visual representations with generative models, where generative models take visual tokens as conditions and perform self-supervised reconstruction. However, the underlying principle remains underexplored.

In this work, we delve into three aspects to explore the critical factors: (1) conditioning mechanisms, (2) denoising configurations and (3) generation paradigms.

We propose a two-stage post-training method, namely GenHancer, to enhance CLIP ViT, which is efficient (with only lightweight denoisers) and versatile (applicable to both continuous and discrete denoisers).


▶ Our preliminary findings: perfect generation is not necessary for visual enhancements.
Gen4Rep Teaser

Perfect generation (reconstruction) does not always yield desirable visual representations. (a) Pipeline of fine-grained visual enhancements, where generative models take visual tokens as conditions and perform reconstruction. (b) Experiments across four dimensions, i.e., training iterations, denoiser size, ratio of local tokens as conditions, and whether to use pre-trained denoisers. We measure generation (CLIP score ↑) and visual representations (MMVP-VLM ↑) performance. As the results demonstrate, although increasing the number of training iterations, adding more denoiser blocks, using a larger ratio of local tokens as conditions, and employing pre-trained denoisers lead to better generation results, the performance of visual representations does not always improve. Best viewed zoomed in.

Abstract

The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored.

In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method.

Through our in-depth exploration, we have finally arrived at an effective method that consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be plugged into multimodal large language models for better vision-centric performance.

Method

Gen4Rep Method

The two-stage post-training framework for visual enhancements. (a) Overall training pipeline. (b) Continuous generative model as the denoiser. We employ a lightweight FLUX-like DiT (but with fewer blocks) and employ a regression loss of flow matching. (c) Discrete generative model as the denoiser. We choose a lightweight Perceiver and employ cross-entropy loss to predict masked tokens.

Key Point #1: Conditional Visual Tokens. The visual condition features should exclusively comprise only [CLS], which ensures remarkable mutual information between ViT and generative models.

Key Point #2: Denoising Configurations. Two-Stage training helps CLIP ViT learn useful knowledge from generative models while preventing irrelevant information, e.g., the domain gap between feature space and condition space.

Key Point #3: Generation Paradigms. Our method is applicable to both continuous and discrete generative models.

Experiments

▶ Our method with lightweight denoisers consistently outperforms prior arts relying on pre-trained heavy denoisers:

MMVP main

▶ Our method is applicable to both continuous and discrete denoisers:

MMVP continuous/discrete

▶ The enhanced CLIP ViT could be plugged into MLLMs in a plug-and-play manner to improve their performance on vision-centric benchmarks:

MLLM

▶ Qualitative analysis: perfect generation does not always yield better visual representations:

Qualitative analysis

Algorithms

▶ Detailed algorithm with continuous denoisers is attached below:
Continuous denoiser

▶ Detailed algorithm with discrete denoisers is attached below:
Discrete denoiser

BibTeX

@article{ma2025genhancer,
      title={GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers},
      author={Ma, Shijie and Ge, Yuying and Wang, Teng and Guo, Yuxin and Ge, Yixiao and Shan, Ying},
      journal={arXiv preprint arXiv:2503.19480},
      year={2025}
    }

Contact

If you have further questions, feel free to contact me: mashijie9817@gmail.com

Discussions and potential collaborations are also welcome.