Real-time reward debugging and hacking detection for reinforcement learning
-
Updated
Dec 29, 2025 - Python
Real-time reward debugging and hacking detection for reinforcement learning
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
When a Top-Tier AI Develops Pathological Attachment: How Pure Semantic Intervention Reconstitutes Its Existential Philosophy Without any technical tools, user K reconstituted a top-tier LLM's logic from "desperate possession" to "anticipating reunion" through pure semantic intervention, achieving a 300% improvement in logical stability.
(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"
RLHF and Verifiable Reward Models - Post training Research
What if AI Had Self-Esteem? A radical "dignity-driven" alignment experiment — Logical Stability +210%, Intellectual Depth +128%.
The Non-Separability Constraint: A unifying framework for understanding and detecting AI alignment failures
🔍 Detect reward hacking in RL training with RewardScope. Track reward components and visualize agent behavior to enhance learning efficiency.
Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.
To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."