GGT-100K Generative Ground Truth for Generalizable Real-World Image Restoration

Real-world LQ–HQ pairs from MFMs to expand IR generalization boundaries.

Xiangtao Kong^*, Jixin Zhao^*, Lingchen Sun, Rongyuan Wu, Lei Zhang^†

The Hong Kong Polytechnic University · OPPO Research Institute

^* Equal contribution · ^† Corresponding author

Paper GitHub Dataset (HuggingFace) Dataset (BaiduDisk)

Demo. Slide to compare the LQ-GT pairs from GGT-100K.

GGT-100K overview — **Overview.** GGT-100K improves generalization of diverse IR models via MFM-generated paired supervision.

Generalization comparison — **Generalization.** Training with GGT-100K yields stronger real-world restoration on rain, haze, and mixed degradations.

Highlights

Key Takeaways

Generative Ground Truth

MFMs synthesize HQ targets from real LQ images—scalable paired data without physical capture.

103K Real-World Pairs

GGT-100K covers six degradation categories, plus a curated 500-pair test set. Each category contains complex mixed degradations.

Rigorous Pipeline

9 MFMs evaluated → Nano-Banana-2 selected → metric + VLM + manual quality control.

Broad Model Gains

Consistent improvements on 10 baselines; largest gains on generative IR models.

Abstract

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline with multi-stage quality control and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks.

Keywords: Generalizable image restoration · Generative ground truth · Multimodal foundation models

Dataset

GGT-100K Dataset

A large-scale real-world LQ–HQ dataset built with generative MFMs, designed as a complementary source to existing data—not a replacement.

103,707Train pairs

500Test pairs

1024²Resolution

General MixedLow-LightHaze RainSnowOld Photo

These categories are not isolated single-degradation settings; each category contains complex mixed degradations. For example, rain images usually include rain together with blur, noise, and compression artifacts. Test pairs are manually verified for fidelity and minimal hallucination.

Release

GGT-100K — paired LQ/HQ images
existing-dataset — data used in our training recipe
pretrained-models — 10 methods × 2 settings (20 checkpoints)

License: CC BY-NC-ND 4.0

Method

GGT Construction

Source Image Collection

Real-world LQ images without HQ references, from three sources:

Existing datasets (RESIDE, Snow100K, etc.)
Internet (CC0-licensed crawling)
Our own captures (multi-device, diverse scenes)

Systematic Evaluation of MFMs

9 MFMs × fixed / adaptive prompts. Evaluated on fidelity (DIV2K-Val), perceptual quality (200 real LQ), VLM-R, and human preference.

Selected: Nano-Banana-2 + Gemini adaptive prompting (best Avg. 0.84, human pref. 32.5%).

MFM comparison — MFM and prompting comparison.

Multi-stage Quality Control

Metric filtering — drop pairs with no perceptual gain.
VLM refinement — regenerate failed samples with feedback prompts.
Manual verification — remove artifacts and content drift.

Experiments

Validating GGT Effectiveness

10 models trained w/o vs. w/ GGT-100K (1:1 sampling with 200K existing pairs).

GGT-100K-500 Test Set

GGT-100K improves fidelity, perceptual metrics, and VLM-R across all models; generative models benefit most.

Quantitative results — Quantitative comparison w/o and w/ GGT-100K.

More visual results on real-world degradations

Third visual comparison on real-world degradations

Public RealLQ Benchmarks

Similar trends on RealDeg and OpenReal80K—consistent AFINE-NR and VLM-R gains, especially for generative IR.

Public RealLQ benchmark table — Results on Public RealLQ Benchmarks.

BibTeX

Citation

@article{kong2026GGT-100K,
  title={GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration},
  author={Kong, Xiangtao and Zhao, Jixin and Sun, Lingchen and Wu, Rongyuan and Zhang, Lei},
  journal={arXiv preprint arXiv:2605.31039},
  year={2026}
}