- June 26, 2026: 🏅 GRN 8B is out now! A unified single model covering T2V, I2V and T2I. Its performance rivals Wan 2.1 14B, AR models never surrender!
- June 18, 2026: 🍾 GRN is accepted by ECCV 2026.
- June 8, 2026:
✈️ The training & fine-tuning code for GRN-T2I and GRN-T2V is released. - June 3, 2026: 🍉 A toy image-video dataset is provided for GRN-T2I/GRN-T2V training and fine-tuning.
- May 23, 2026: 🌺 We release the training and evaluation code for HBQ tokenizer, enjoy~
- April 14, 2026: 🤗 Paper and code release
- 🌟 Introduction
- ✨ Gallery
- 🚀 Demo
- 📦 Model Zoo
- 🛠️ Installation
- 📦 HBQ Tokenizer
- 🖼️ Class-to-Image
- 🎨 Text-to-Image
- 🎬 Text-to-Video
- 🎬 Image-to-Video
- 📧 Contact
- 🤗 Acknowledgements
- 📝 Citation
This is the official implementation of the paper Generative Refinement Networks for Visual Synthesis. Neither diffusion nor autoregressive — GRN is a third way. 🧠 Refines globally like an artist. ⚡ Generates adaptively by complexity. 🏆 New SOTA across image & video. The visual generation paradigm just got rewritten.
Diffusion models dominate visual generation but they allocate uniform computational effort to samples with varying levels of complexity. Autoregressive (AR) models are complexity-aware, as evidenced by their variable likelihoods, but suffer from lossy tokenization and error accumulation.
We introduce Generative Refinement Networks (GRN), a new visual synthesis paradigm that addresses these issues:
- Near-lossless tokenization via Hierarchical Binary Quantization (HBQ)
- Global refinement mechanism that progressively perfects outputs like a human artist
- Entropy-guided sampling for complexity-aware, adaptive-step generation
GRN achieves state-of-the-art results on ImageNet reconstruction and class-conditional generation, and scales effectively to text-to-image and text-to-video tasks.
Generative Refinement Framework
Starting from a random token map, GRN randomly selects more predictions at each step and refines all input tokens. For example, compared to the second step, the third step filled six new tokens (pink), kept two tokens (blue), erased two tokens (yellow), and left six tokens blank (gray).
01.mp4 |
02.mp4 |
03.mp4 |
04.mp4 |
05.mp4 |
06.mp4 |
07.mp4 |
08.mp4 |
09.mp4 |
01.mp4 |
02.mp4 |
03.mp4 |
04.mp4 |
05.mp4 |
06.mp4 |
07.mp4 |
08.mp4 |
09.mp4 |
Try our interactive Text-to-Image demo on 🤗 Hugging Face Space:
Experience the power of Generative Refinement Networks firsthand by generating images from text prompts directly in your browser!
Try our interactive Text-to-Video demo on Discord:
T2V Demo on Discord| Model | Checkpoints |
|---|---|
| Tokenizers | ✅ ImageNet Tokenizer ✅ Joint Image/Video Tokenizer |
| GRN_ind_C2I | ✅ B ⬜ L (TBD) ⬜ H (TBD) ⬜ G (TBD) |
| GRN_bit_T2I | ✅ GRN_T2I |
| GRN_bit_T2V | ✅ GRN_T2V |
git clone https://github.com/bytedance/GRN
cd GRNA suitable conda environment named GRN can be created and activated with:
conda create -n GRN python=3.11
conda activate GRN
pip install -r requirements.txtIf you get undefined symbol: iJIT_NotifyEvent when importing torch, simply:
pip uninstall torch
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124Check this issue for more details.
Image Dataset, e.g., data_root/username/labels/imagenet/train.txt:
[image_1_full_path]
[image_2_full_path]
[image_3_full_path]
...
Video Dataset, e.g., data_root/username/labels_hanjian/high-quality-video/horizontal_videos.txt
[video_1_full_path]
[video_2_full_path]
[video_3_full_path]
...
For example, set latent_channels=16/64 and quant_method=hierarchical_binary_quant_round_4 in scripts/hbq_tokenizer_train.sh, then run:
cd grn/tokenizer
bash scripts/hbq_tokenizer_train.shFor example, set latent_channels=16/64 and quant_method=hierarchical_binary_quant_round_4 in scripts/hbq_tokenizer_train.sh, then run:
cd grn/tokenizer
bash scripts/hbq_tokenizer_eval.shDownload ImageNet dataset, and place it in your IMAGENET_PATH.
All training scripts are located in scripts/c2i/. We suggest using 8x80GB GPUs for most models.
| Model | Training Script | GPUs Required |
|---|---|---|
| GRN_ind_B | bash scripts/c2i/train_GRN_ind_B.sh |
8x80GB |
| GRN_bit_B | bash scripts/c2i/train_GRN_bit_B.sh |
8x80GB |
| GRN_ind_L | bash scripts/c2i/train_GRN_ind_L.sh |
8x80GB |
| GRN_ind_H | bash scripts/c2i/train_GRN_ind_H.sh |
16x80GB |
| GRN_ind_G | bash scripts/c2i/train_GRN_ind_G.sh |
32x80GB |
PyTorch pre-trained models are available here.
All evaluation scripts are located in scripts/c2i/. We suggest using 8x80GB vRAM GPUs.
| Model | Evaluation Script |
|---|---|
| GRN_ind_B | bash scripts/c2i/eval_GRN_ind_B.sh |
| GRN_bit_B | bash scripts/c2i/eval_GRN_bit_B.sh |
| GRN_ind_L | bash scripts/c2i/eval_GRN_ind_L.sh |
| GRN_ind_H | bash scripts/c2i/eval_GRN_ind_H.sh |
| GRN_ind_G | bash scripts/c2i/eval_GRN_ind_G.sh |
We use torch-fidelity to evaluate FID and IS against a reference image folder or statistics. We use the JiT's pre-computed reference stats under grn/utils_c2i/fid_stats.
Refer to data/toy_data/jsonls/000001/0001_0800_000000100.jsonl
{"image_path": "[image_path_1]", "long_caption": "xxx", "long_caption_type": "caption-InternVL2.0", "text": "", "short_caption_type": "blip2_caption", "width": 1080, "height": 1920}
{"image_path": "[image_path_2]", "long_caption": "xxx", "long_caption_type": "caption-InternVL2.0", "text": "", "short_caption_type": "blip2_caption", "width": 1080, "height": 1920}
...
Run bash scripts/t2iv/train_GRN_bit_t2iv.sh
You can simply run python3 tools/t2i_infer.py or use the following code:
from PIL import Image
from tools.grn_pipeline import GRNPipeline
# Load pipeline
pipeline = GRNPipeline.from_pretrained(
hf_repo_id='bytedance-research/GRN',
task='T2I',
pn='1M',
model='GRN2b',
device='cpu',
).to('cuda')
# Generate one image
result = pipeline(
prompt="<T2I>" + "A cute cat playing in the garden",
guidance_scale=3.0,
temperature=1.1,
complexity_aware_Tmin=10,
complexity_aware_Tmax=50,
complexity_aware_k = 0,
complexity_aware_b = 50,
complexity_aware_wp = 5,
snr_shift = 1.,
h_div_w=1.,
content_type='image',
seed=42,
)
image = result.images[0]
image.save('./generated_image.jpg')Refer to data/toy_data/jsonls/000001/0001_0800_000000100.jsonl
{"video_path": "[video_path_1]", "begin_frame_id": xxx, "end_frame_id": xxx, "quality_prompt": "There is text in the video.", "fps": 25.0, "duration": 3.88, "width": 1280, "height": 720, "caption": [{"type": "short", "content": "[short_caption]"}, {"type": "medium", "content": "[medium_caption]"}, {"type": "long", "content": "[long_caption]"}]}
{"video_path": "[video_path_1]", "begin_frame_id": xxx, "end_frame_id": xxx, "quality_prompt": "The quality is very high!", "fps": 25.0, "duration": 3.88, "width": 1280, "height": 720, "caption": [{"type": "short", "content": "[short_caption]"}, {"type": "medium", "content": "[medium_caption]"}, {"type": "long", "content": "[long_caption]"}]}
...
Run bash scripts/t2iv/train_GRN_bit_t2iv.sh
You can simply run python3 tools/t2v_infer.py or use the following code:
from tools.grn_pipeline import GRNPipeline
# Load pipeline
pipeline = GRNPipeline.from_pretrained(
hf_repo_id='bytedance-research/GRN',
task='T2V',
pn='0.41M',
model='GRN2b', # 'GRN2b' or 'GRN8b'
device='cpu',
).to('cuda')
# Generate one video
result = pipeline(
prompt="Two women demonstrate a makeup product, applying it with a sponge while smiling and engaging with the camera in a bright, clean setting.",
guidance_scale=4.0,
temperature=1.0,
complexity_aware_Tmin=10,
complexity_aware_Tmax=50,
complexity_aware_k = 0,
complexity_aware_b = 50,
complexity_aware_wp = 5,
snr_shift = 1.,
h_div_w=9/16,
duration=2.,
content_type='video',
seed=42,
)
video_file = result.videos[0]Refer to data/toy_data/jsonls/000001/0001_0800_000000100.jsonl
{"video_path": "[video_path_1]", "begin_frame_id": xxx, "end_frame_id": xxx, "quality_prompt": "There is text in the video.", "fps": 25.0, "duration": 3.88, "width": 1280, "height": 720, "caption": [{"type": "short", "content": "[short_caption]"}, {"type": "medium", "content": "[medium_caption]"}, {"type": "long", "content": "[long_caption]"}]}
{"video_path": "[video_path_1]", "begin_frame_id": xxx, "end_frame_id": xxx, "quality_prompt": "The quality is very high!", "fps": 25.0, "duration": 3.88, "width": 1280, "height": 720, "caption": [{"type": "short", "content": "[short_caption]"}, {"type": "medium", "content": "[medium_caption]"}, {"type": "long", "content": "[long_caption]"}]}
...
Run bash scripts/t2iv/train_GRN_bit_t2iv.sh
You can simply run python3 tools/i2v_infer.py or use the following code:
from tools.grn_pipeline import GRNPipeline
# Load pipeline
pipeline = GRNPipeline.from_pretrained(
hf_repo_id='bytedance-research/GRN',
task='T2V',
pn='0.41M',
model='GRN8b',
device='cpu',
).to('cuda')
# Generate one video
result = pipeline(
prompt="<I2V>视频展示了一辆红色敞篷跑车在城市道路中行驶的连续画面。车辆以中等速度前进,车身光滑,反射着黄昏的暖光,黑色轮毂与红色车漆形成对比。驾驶员为男性,专注地操控方向盘,姿态放松。道路两侧排列着高大的棕榈树,背景中可见石质围栏和模糊的建筑轮廓。随着视频推进,一辆白色SUV从后方快速驶过,产生动态模糊,突显跑车的稳定行驶。镜头保持相对固定的侧前方视角,轻微跟随车辆移动,捕捉车身线条与光影变化。整体画面色调温暖,光线柔和,营造出一种优雅而动感的都市驾驶氛围. high aesthetic and high quality video.",
guidance_scale=4.0,
temperature=1.0,
complexity_aware_Tmin=10,
complexity_aware_Tmax=50,
complexity_aware_k = 0,
complexity_aware_b = 50,
complexity_aware_wp = 5,
snr_shift = 1.,
h_div_w=9/16,
duration=2.,
content_type='video',
seed=42,
first_frame_condition=True,
first_frame_path='./assets/i2v_example.jpg',
)
video_file = result.videos[0]If you are interested in scaling GRN for image generation / image editing / video generation / video editing / unified model directions, please feel free to reach out!
📧 Email: hanjian.thu123@bytedance.com
- Thanks to JiT, Infinity and InfinityStar for their wonderful work and codebase!
If you find our work useful, please consider citing:
@misc{han2026grn,
title={Generative Refinement Networks for Visual Synthesis},
author={Jian Han and Jinlai Liu and Jiahuan Wang and Bingyue Peng and Zehuan Yuan},
year={2026},
eprint={2604.13030},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.13030},
}


