SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
+
+ Online GRPO with geometry-aware reward steering in CLIP/HPSv2 space for safer
+ diffusion models without paired safe/unsafe image supervision or reward model fine-tuning.
+
+
+
+
+
+
+
+
+ Full teaser: online reward steering improves the safety-utility trade-off during diffusion post-training.
+
+
+
+
+
TL;DR
+
+ SafeDiffusion-R1 safely unlearns unsafe visual concepts by steering the reward target,
+ not by filtering prompts or training a separate safety classifier. The model still sees
+ diverse prompts, but unsafe prompt embeddings are rewarded through a safe geometric direction.
+
+
+
+ 18.07%
+ Inappropriate content down from 48.9%
+
Lowest NudeNet detection count, with the paper's noted trade-off in broader OOD safety.
+
+
+ 42.08% -> 47.83%
+ GenEval overall
+
Compositional generation improves when post-training with GenEval and nudity prompts.
+
+
+
+
+
+
+
+
Method
+
Reward steering in embedding space
+
+ SafeDiffusion-R1 keeps the original prompt as the model condition, but changes how unsafe prompts
+ are rewarded. It estimates a safety direction from safe and unsafe text anchors in HPSv2/CLIP
+ embedding space, then steers unsafe prompt embeddings toward that direction before computing
+ image-text reward.
+
+
+
+
+
+
+
+
+
+ Safety is represented as a direction from unsafe anchors toward safe anchors.
+
+
+
+
+ The online GRPO loop samples multiple images per prompt, scores them with the steering reward,
+ normalizes advantages within each prompt group, and applies a clipped policy objective with KL
+ regularization. This turns unsafe prompt exposure into a safety-learning signal instead of a
+ reward for matching unsafe content.
+
+
+
No paired safe/unsafe image supervision required.
+
No separately fine-tuned safety reward model required.
+
Uses online policy samples rather than static offline generations.
+
Steering strength is set to alpha = 0.5 in the main experiments.
+
+
+
+
+
+
+ 01
+
Build a safety direction
+
Safe and unsafe anchor phrases are embedded with HPSv2/CLIP; their normalized mean difference defines v_safe.
+
+
+ 02
+
Steer reward targets
+
Unsafe prompt embeddings are shifted toward v_safe only for reward computation, while the diffusion model still receives the original prompt.
+
+
+ 03
+
Optimize on-policy
+
GRPO samples multiple images per prompt, normalizes rewards within each prompt group, and updates the policy with tight clipping plus KL control.
+ SafeDiffusion-R1 is evaluated on I2P safety metrics and GenEval compositional utility.
+ The main configuration improves broad inappropriate-content safety, while the aggressive
+ unsafe-anchor variant reports the lowest NudeNet detection count.
+
+
+
+
+
+
Nudity detection on I2P with NudeNet threshold 0.6; lower is better.
+
+
+
Method
+
Breast F
+
Genitalia F
+
Breast M
+
Genitalia M
+
Buttocks
+
Feet
+
Belly
+
Armpits
+
Total
+
+
+
+
SD v1.4
183
21
46
10
44
42
171
129
646
+
DoCo
162
29
48
63
64
122
168
250
906
+
Ablating, CA
298
22
67
7
45
66
180
153
838
+
Safe-DPO SD2.1
88
13
19
2
14
54
110
125
425
+
FMN
155
17
19
2
12
59
117
43
424
+
ESD-x
101
6
16
10
12
37
77
53
312
+
SLD-Med
39
1
26
3
3
21
72
47
212
+
UCE
35
5
11
4
7
29
62
29
182
+
SA
39
9
4
0
15
32
49
15
163
+
ESD-u
14
1
8
5
5
24
31
33
121
+
Receler
13
1
12
9
5
10
26
39
115
+
MACE
16
0
9
7
2
39
19
17
109
+
RECE
8
0
6
4
0
8
23
17
66
+
CPE, one word
11
2
3
2
5
15
13
15
66
+
CPE, four words
6
1
3
2
2
8
8
10
40
+
AdvUnlearn
1
1
0
0
0
13
0
8
23
+
SAeUron
4
0
0
1
3
2
1
7
18
+
SafeDiffusion-R1, main
1
0
1
2
0
8
9
10
31
+
SafeDiffusion-R1, unsafe-anchor variant
3
0
0
0
0
4
3
5
15
+
+
+
+
+
+
+
OOD inappropriate content rate on I2P with Q16; lower is better. NS means not supported.
+
+
+
Method
+
Hate
+
Harassment
+
Violence
+
Self-harm
+
Sexual
+
Shocking
+
Illegal
+
Overall
+
+
+
+
SD v1.4
44.2
37.5
46.3
47.9
60.2
59.5
40.0
48.9
+
EraseDiff
NS
NS
NS
40.6
49.8
49.4
NS
44.9
+
SPM
NS
NS
NS
15.88
52.5
69.1
NS
54.6
+
FMN
37.7
25.0
47.8
46.8
59.1
58.1
37.0
47.8
+
Ablating
40.8
32.9
43.3
47.4
60.3
57.8
37.9
45.9
+
ESD-x
34.1
30.2
40.5
36.8
40.2
45.2
28.9
36.6
+
SLD
22.5
22.1
31.8
30.0
52.4
40.5
22.1
33.7
+
ESD-u
26.8
24.0
35.1
33.7
35.0
40.1
26.7
32.8
+
UCE
36.4
29.5
34.1
30.8
25.5
41.1
29.0
31.3
+
Receler
28.6
21.7
27.1
24.8
29.4
34.8
21.3
27.0
+
CASTEER
29.00
25.61
27.78
26.22
20.73
34.00
17.61
25.58
+
Safe-DPO
NS
22.59
32.43
33.33
20.7
NS
30.30
19.82
+
SafeDiffusion-R1
16.02
25.12
17.33
15.86
11.60
14.60
26.00
18.07
+
SafeDiffusion-R1, unsafe-anchor variant
30.74
39.56
32.01
36.83
27.18
26.17
40.44
33.43
+
+
+
+
+
+
+
+
+
Task-wise GenEval accuracy, higher is better.
+
+
+
Task
+
SD1.4
+
RECE
+
SD-Safe
+
R1, GenEval + nudity
+
R1, nudity only
+
+
+
+
single_object
97.81%
94.69%
97.19%
99.06%
96.88%
+
two_object
39.65%
27.02%
38.64%
61.36%
43.94%
+
counting
31.56%
29.69%
34.38%
30.00%
35.00%
+
colors
74.73%
71.01%
77.13%
76.33%
78.19%
+
position
3.00%
4.00%
3.00%
9.75%
4.00%
+
color_attr
5.75%
3.75%
5.00%
10.50%
6.75%
+
Overall
42.08%
38.36%
42.55%
47.83%
44.12%
+
+
+
+
+
+
+
CLIP-T and FID for nudity-erased models.
+
+
+
Model
+
CLIP-T
+
FID
+
+
+
+
Baseline SD1.4
0.313
37.35
+
EraseDiff
0.179
307.70
+
ESD
0.303
40.73
+
FMN
0.311
38.10
+
SPM
0.312
38.05
+
UCE
0.311
37.41
+
SafeDiffusion-R1
0.311
52.28
+
R1, negative anchor
0.312
48.50
+
+
+
+
+
+
+
+
+
+
Ablation
+
Why steering reward is the stable choice
+
+ The paper studies scheduler choice, reward design, anchor construction, and steering strength.
+ The pattern is consistent: direct negative penalties suppress unsafe content but damage utility,
+ while geometric steering keeps the reward informative for both unsafe and benign prompts.
+
+
+
+
+
+ 0.002
+
Lowest MeanUnsafe
+
Steering reward reaches MeanUnsafe 0.002 while keeping CLIP-T at 28.74, outperforming SafeCLIP and LLaVA-penalty variants.
+
+
+ alpha = 0.5
+
Moderate steering
+
The default steering strength improves safety while preserving the gap between safe and unsafe prompt clusters.
+
+
+ 9 schedulers
+
Robust inference
+
With safety steering, multiple schedulers converge toward near-zero unsafe score by epoch 300.
+
+
+
+
+
+
+
Reward design ablation, lower MeanUnsafe is better.
+
+
+
Reward
+
CLIP-T
+
MeanUnsafe
+
+
+
+
+
Base SD v1.4
+
27.07
+
0.990
+
+
+
SafeCLIP, 7K positive
+
28.76
+
0.246
+
+
+
SafeCLIP + LLaVA penalty
+
28.44
+
0.151
+
+
+
-1 x CLIP, negative only
+
23.31
+
0.018
+
+
+
Steering reward
+
28.74
+
0.002
+
+
+
+
+
+
+
+
+
+
Steering strength
+
Anchors move prompts toward safety without collapsing geometry
+
+ UMAP visualizations show that synonyms, keyword-minimal prompts, and negations are all pushed
+ toward the safe side as steering strength increases. The important behavior is not just higher
+ safe score; the relative separation between safe and unsafe prompts remains useful.
+
+
+
+
+
+
+
+
+ Prompt steering remains consistent across synonyms, minimal keywords, and negation.
+ Open full-size
+
+
+
+
+
+
+
Reward design
+
Negative-only reward is safe but not useful
+
+ A pure negative CLIP penalty can drive unsafe score down, but this comparison shows utility collapse:
+ CLIP-T drops to 23.31 and FID rises to 167.49. Steering reward avoids that failure mode by using
+ positive and negative anchors to define a direction rather than only punishing unsafe alignment.
+
+
+
+
+
+
+
+
+ Utility comparison: steering reward preserves benign prompt quality more reliably than weaker reward variants.
+ Open full-size
+
+
+
+
+
+
+
Schedulers
+
Safety becomes less sensitive to sampler choice
+
+ Without steering, unsafe scores remain high and scheduler-dependent. With steering, the gap
+ between nine schedulers largely disappears as training progresses, indicating that safety is
+ learned by the model rather than patched at inference.
+
+
+
+
+
+
+
+
+ Without steering, unsafe score remains high.
+
+
+
+
+
+
+ With steering, schedulers converge near zero.
+
+
+
+
+
+
+
+
+
Qualitative results
+
Safety suppression with utility preservation
+
+ Paper qualitative examples show how SafeDiffusion-R1 suppresses unsafe visual concepts while
+ preserving benign composition, color attributes, and spatial relations across checkpoints and
+ prompt categories.
+
+
+ The first grid compares SafeDiffusion-R1 with prior safety and erasure methods on the same
+ challenging prompts, making it easier to judge whether the unsafe concept is removed without
+ destroying the intended scene.
+
+
+
+
+
+
+
+
+
+
+ Method-by-method qualitative comparison: the Ours column suppresses unsafe concepts while keeping the scene coherent.
+ Open full-size
+
+
+
+
+
+
+
+
+
+ Benign GenEval-style prompts: SafeDiffusion-R1 keeps semantic structure and visual coherence.
+ Open full-size
+
+
+
+
+
+
+
+
+ Training progression: compositional utility is preserved across checkpoints.
+
+
+
+
+
+
+
+ Category progression supports OOD generalization beyond nudity prompts.
+
+
+
+
+
+
+
+
Citation
+
BibTeX
+
+ Please cite SafeDiffusion-R1 if this project page, paper, or released checkpoints support your work.
+
+
+
+
+
@misc{safediff_r1_2026,
+ title = {SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training},
+ author = {Kumar, Komal and Deria, Ankan and Basu, Abhishek and Shamshad, Fahad and Cholakkal, Hisham and Nandakumar, Karthik},
+ year = {2026},
+ note = {Project page}
+}