BugHunterRL: RL for Automated Code Debugging

title	Code Debugger Env
emoji	🐞
colorFrom	red
colorTo	indigo
sdk	docker
pinned	false
app_port	7860

BugHunterRL: RL for Automated Code Debugging

Submission for Meta × PyTorch OpenEnv Hackathon @ Scaler
13 real-world Python debugging tasks • Regression Test Oracle • Code Smell AST Penalty
Deployed on HF Spaces • FastAPI + Docker • OpenEnv Core 0.2.1

🌟 Why BugHunterRL?

BugHunterRL is a production-grade OpenEnv environment for training and evaluating RL agents on real-world Python debugging and security auditing. Agents must fix actual bugs, pass regression tests, and avoid introducing dangerous code patterns.

Capability	Description
Regression Test Oracle	Every task has failing_tests (must fix) + passing_tests (must not break)
Code Smell AST Penalty	-40% score if agent introduces eval(), bare except, hardcoded secrets, or infinite loops
Security Grader	Detects SQL injection, OS command injection, and weak hashing
Multi-File Simulation	Hard tasks simulate cross-module dependency bugs
Dynamic Randomization	30% chance of randomized task variant to prevent memorization

🏗️ Environment Specifications

Feature	Specification
API Type	RESTful OpenAI-compatible (FastAPI)
SDK	openenv-core==0.2.1
Task Count	13 Graded Tasks
Difficulty Tiers	Easy (4), Medium (4), Hard (5)
Reward Range	Strictly (0.001, 0.999) — Phase-2 validator compliant
Deployment	Docker-based Hugging Face Space
Max Episode Steps	5 (all difficulties)
Inference Timeout	1200 seconds

🔌 API Endpoints

Endpoint	Method	Description
`/`	GET	Root — provides status and endpoint directory
`/reset`	POST	Start new episode, returns first observation
`/step`	POST	Submit action, returns reward + observation
`/state`	GET	Returns current episode state
`/health`	GET	Health check — returns {"status": "healthy"}
`/metadata`	GET	Environment metadata
`/stats`	GET	Live runtime statistics
`/schema`	GET	Returns JSON schemas for actions, observations, and state

Verified Deployment

The BugHunterRL environment is publicly deployed and reachable on Hugging Face Spaces. The API has been manually verified on the live Space to ensure zero-latency readiness for evaluation.

Root: https://raunit19-code-debugger-env.hf.space/
Health: https://raunit19-code-debugger-env.hf.space/health
Metadata: https://raunit19-code-debugger-env.hf.space/metadata
Stats: https://raunit19-code-debugger-env.hf.space/stats
Swagger Docs: https://raunit19-code-debugger-env.hf.space/docs
OpenAPI JSON: https://raunit19-code-debugger-env.hf.space/openapi.json

🎮 Action Space

Agents submit a CodeDebugAction to /step:

Field	Type	Description
`bug_line`	int	1-indexed line number of the bug
`bug_type`	str	logic / runtime / security / mutable_state / syntax
`fixed_code`	str	Complete corrected Python snippet
`explanation`	str	Technical explanation of the fix

Example `/step` Interaction

This is an illustrative example of how agents interact with the environment:

{
  "action": {
    "bug_line": 2,
    "bug_type": "logic",
    "fixed_code": "def double_all(lst):\n    result = []\n    for i in range(len(lst)):\n        result.append(lst[i] * 2)\n    return result",
    "explanation": "Fixed the off-by-one bug by iterating across the full list instead of len(lst) - 1."
  }
}

Response:

{
  "observation": {
    "task_id": "easy_01",
    "code_snippet": "def double_all(lst):\n    result = []\n    for i in range(len(lst) - 1):\n        result.append(lst[i] * 2)\n    return result",
    "task_description": "double_all should return a new list with every element doubled. The current implementation has an off-by-one error — it skips the last element.",
    "test_hint": "Tested with: ->, ->, []->[], result must be a list",
    "feedback": "All failing tests fixed. No regressions introduced.",
    "attempt_number": 1,
    "score_so_far": 0.999,
    "difficulty": "easy"
  },
  "reward": 0.999,
  "done": true
}

🔍 Observation Space

Field	Type	Description
`code_snippet`	str	Buggy Python code to debug
`task_description`	str	Detailed requirements
`test_hint`	str	Test case information
`feedback`	str	Grader output from previous attempt
`attempt_number`	int	Current attempt (1–5)
`score_so_far`	float	Best score this episode
`difficulty`	str	easy / medium / hard
`reward`	float	Delta reward (0.001–0.999)
`done`	bool	True when episode ends

📊 Grading System

Layer 1: Regression Test Oracle

Reward = (tests_fixed / total_failing) − (tests_broken / total_passing)

Layer 2: Code Smell Penalty (AST-based)

Score × 0.6 (−40%) if agent introduces: eval()/exec(), bare except:, hardcoded credentials, or infinite while True loops

Layer 3: Security Pattern Detection

Hard security tasks verify removal of dangerous patterns and presence of safe alternatives

All scores strictly clamped between 0.001 and 0.999.

Why this environment is hard for agents

BugHunterRL is designed as a meaningful RL benchmark that tests rigorous reasoning rather than simple pattern matching:

Regression Test Oracle: Agents must fix specific failing tests without breaking existing passing behavior; rewards are highly sensitive to regressions.
Security-aware tasks: Hard tasks require removing deep-seated vulnerabilities like SQL injection, weak hashes, and unsafe shell usage rather than superficial edits.
Code-smell penalty: AST-based penalty for eval()/exec(), bare except:, hardcoded secrets, and infinite loops discourages mechanical reward hacking.
Multi-step reasoning: Significant bugs involve mutable default arguments or cross-module inconsistencies, which cannot be solved by single-line patches.
Randomized variants: A portion of task variants are randomized to reduce memorization and force agents to generalize their debugging logic.

🗂️ Task Catalog

Easy (4 tasks)

Task ID	Bug	Type
easy_01	Off-by-one in list doubler	logic
easy_02	IndexError in palindrome checker	runtime
easy_03	Missing assignment (count+1 vs count+=1)	logic
easy_04	Product initialized to 0 instead of 1	logic

Medium (4 tasks)

Task ID	Bug	Type
medium_01	Infinite recursion (lst not sliced)	runtime
medium_02	Float division in binary search	runtime
medium_03	Wrong return variable	logic
medium_04	Wrong return variable	logic

Hard (5 tasks)

Task ID	Bug	Type
hard_01	Mutable default argument	mutable_state
hard_02	SQL Injection via f-string	security
hard_03	Weak MD5 password hashing	security
hard_04	OS command injection via shell=True	security
hard_05	Cross-module typo superuser vs super_user	logic

📈 Baseline Scores (Meta Llama 3.1 8B)

Difficulty	Avg Score
Easy	0.85
Medium	0.72
Hard	0.48
Overall	0.68

🚀 Quickstart

Run Locally

git clone https://huggingface.co/spaces/raunit19/code-debugger-env
cd code-debugger-env
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:.
python server/app.py

Verify

curl http://localhost:7860/health
# {"status": "healthy"}

🤖 Reproduce Baseline Evaluation

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="your_token_here"
export ENV_BASE_URL="http://localhost:7860"
python inference.py

Evaluator-facing logs are emitted through the standardized [START], [STEP], and [END] format for deterministic parsing.

Reproduce in 60 seconds

Follow these steps to quickly verify the environment and baseline evaluation.

Open the live Space: https://raunit19-code-debugger-env.hf.space/
Check the health endpoint: /health should return {"status": "healthy"}.
Use /docs to call POST /reset and inspect the initial observation.
Run the baseline evaluation script locally:

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="your_huggingface_token"
python inference.py

inference.py emits standardized [START], [STEP], and [END] logs to stdout for the OpenEnv evaluator.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
server		server
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
environment.py		environment.py
grader.py		grader.py
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tasks.py		tasks.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BugHunterRL: RL for Automated Code Debugging

🌟 Why BugHunterRL?

🏗️ Environment Specifications

🔌 API Endpoints

Verified Deployment

🎮 Action Space

Example `/step` Interaction

🔍 Observation Space

📊 Grading System

Layer 1: Regression Test Oracle

Layer 2: Code Smell Penalty (AST-based)

Layer 3: Security Pattern Detection

Why this environment is hard for agents

🗂️ Task Catalog

Easy (4 tasks)

Medium (4 tasks)

Hard (5 tasks)

📈 Baseline Scores (Meta Llama 3.1 8B)

🚀 Quickstart

Run Locally

Verify

🤖 Reproduce Baseline Evaluation

Reproduce in 60 seconds

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BugHunterRL: RL for Automated Code Debugging

🌟 Why BugHunterRL?

🏗️ Environment Specifications

🔌 API Endpoints

Verified Deployment

🎮 Action Space

Example /step Interaction

🔍 Observation Space

📊 Grading System

Layer 1: Regression Test Oracle

Layer 2: Code Smell Penalty (AST-based)

Layer 3: Security Pattern Detection

Why this environment is hard for agents

🗂️ Task Catalog

Easy (4 tasks)

Medium (4 tasks)

Hard (5 tasks)

📈 Baseline Scores (Meta Llama 3.1 8B)

🚀 Quickstart

Run Locally

Verify

🤖 Reproduce Baseline Evaluation

Reproduce in 60 seconds

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Example `/step` Interaction

Packages