GGT-100K Generative Ground Truth for Generalizable Real-World Image Restoration

Real-world LQ–HQ pairs from MFMs to expand IR generalization boundaries.

Xiangtao Kong*, Jixin Zhao*, Lingchen Sun, Rongyuan Wu, Lei Zhang

The Hong Kong Polytechnic University · OPPO Research Institute

* Equal contribution · Corresponding author

LQ GT
Demo. Slide to compare the LQ-GT pairs from GGT-100K.
GGT-100K overview
Overview. GGT-100K improves generalization of diverse IR models via MFM-generated paired supervision.
Generalization comparison
Generalization. Training with GGT-100K yields stronger real-world restoration on rain, haze, and mixed degradations.

Key Takeaways

Generative Ground Truth

MFMs synthesize HQ targets from real LQ images—scalable paired data without physical capture.

103K Real-World Pairs

GGT-100K covers six degradation categories, plus a curated 500-pair test set. Each category contains complex mixed degradations.

Rigorous Pipeline

9 MFMs evaluated → Nano-Banana-2 selected → metric + VLM + manual quality control.

Broad Model Gains

Consistent improvements on 10 baselines; largest gains on generative IR models.

Abstract

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline with multi-stage quality control and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks.

Keywords: Generalizable image restoration · Generative ground truth · Multimodal foundation models

GGT-100K Dataset

A large-scale real-world LQ–HQ dataset built with generative MFMs, designed as a complementary source to existing data—not a replacement.

103,707Train pairs
500Test pairs
1024²Resolution
General MixedLow-LightHaze RainSnowOld Photo

These categories are not isolated single-degradation settings; each category contains complex mixed degradations. For example, rain images usually include rain together with blur, noise, and compression artifacts. Test pairs are manually verified for fidelity and minimal hallucination.

Release

License: CC BY-NC-ND 4.0

GGT Construction

Construction pipeline
Construction pipeline of GGT-100K.

Source Image Collection

Real-world LQ images without HQ references, from three sources:

  • Existing datasets (RESIDE, Snow100K, etc.)
  • Internet (CC0-licensed crawling)
  • Our own captures (multi-device, diverse scenes)

Systematic Evaluation of MFMs

9 MFMs × fixed / adaptive prompts. Evaluated on fidelity (DIV2K-Val), perceptual quality (200 real LQ), VLM-R, and human preference.

Selected: Nano-Banana-2 + Gemini adaptive prompting (best Avg. 0.84, human pref. 32.5%).

MFM comparison
MFM and prompting comparison.

Multi-stage Quality Control

  1. Metric filtering — drop pairs with no perceptual gain.
  2. VLM refinement — regenerate failed samples with feedback prompts.
  3. Manual verification — remove artifacts and content drift.

Validating GGT Effectiveness

10 models trained w/o vs. w/ GGT-100K (1:1 sampling with 200K existing pairs).

GGT-100K-500 Test Set

GGT-100K improves fidelity, perceptual metrics, and VLM-R across all models; generative models benefit most.

Quantitative results
Quantitative comparison w/o and w/ GGT-100K.
Visual results
More visual results on real-world degradations
Third visual comparison on real-world degradations

Public RealLQ Benchmarks

Similar trends on RealDeg and OpenReal80K—consistent AFINE-NR and VLM-R gains, especially for generative IR.

Public RealLQ benchmark table
Results on Public RealLQ Benchmarks.

Citation

@article{kong2026GGT-100K,
  title={GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration},
  author={Kong, Xiangtao and Zhao, Jixin and Sun, Lingchen and Wu, Rongyuan and Zhang, Lei},
  journal={arXiv preprint arXiv:2605.31039},
  year={2026}
}