| title | Code Debugger Env |
|---|---|
| emoji | 🐞 |
| colorFrom | red |
| colorTo | indigo |
| sdk | docker |
| pinned | false |
| app_port | 7860 |
Submission for Meta × PyTorch OpenEnv Hackathon @ Scaler
13 real-world Python debugging tasks • Regression Test Oracle • Code Smell AST Penalty
Deployed on HF Spaces • FastAPI + Docker • OpenEnv Core 0.2.1
BugHunterRL is a production-grade OpenEnv environment for training and evaluating RL agents on real-world Python debugging and security auditing. Agents must fix actual bugs, pass regression tests, and avoid introducing dangerous code patterns.
| Capability | Description |
|---|---|
| Regression Test Oracle | Every task has failing_tests (must fix) + passing_tests (must not break) |
| Code Smell AST Penalty | -40% score if agent introduces eval(), bare except, hardcoded secrets, or infinite loops |
| Security Grader | Detects SQL injection, OS command injection, and weak hashing |
| Multi-File Simulation | Hard tasks simulate cross-module dependency bugs |
| Dynamic Randomization | 30% chance of randomized task variant to prevent memorization |
| Feature | Specification |
|---|---|
| API Type | RESTful OpenAI-compatible (FastAPI) |
| SDK | openenv-core==0.2.1 |
| Task Count | 13 Graded Tasks |
| Difficulty Tiers | Easy (4), Medium (4), Hard (5) |
| Reward Range | Strictly (0.001, 0.999) — Phase-2 validator compliant |
| Deployment | Docker-based Hugging Face Space |
| Max Episode Steps | 5 (all difficulties) |
| Inference Timeout | 1200 seconds |
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Root — provides status and endpoint directory |
/reset |
POST | Start new episode, returns first observation |
/step |
POST | Submit action, returns reward + observation |
/state |
GET | Returns current episode state |
/health |
GET | Health check — returns {"status": "healthy"} |
/metadata |
GET | Environment metadata |
/stats |
GET | Live runtime statistics |
/schema |
GET | Returns JSON schemas for actions, observations, and state |
The BugHunterRL environment is publicly deployed and reachable on Hugging Face Spaces. The API has been manually verified on the live Space to ensure zero-latency readiness for evaluation.
- Root: https://raunit19-code-debugger-env.hf.space/
- Health: https://raunit19-code-debugger-env.hf.space/health
- Metadata: https://raunit19-code-debugger-env.hf.space/metadata
- Stats: https://raunit19-code-debugger-env.hf.space/stats
- Swagger Docs: https://raunit19-code-debugger-env.hf.space/docs
- OpenAPI JSON: https://raunit19-code-debugger-env.hf.space/openapi.json
Agents submit a CodeDebugAction to /step:
| Field | Type | Description |
|---|---|---|
bug_line |
int | 1-indexed line number of the bug |
bug_type |
str | logic / runtime / security / mutable_state / syntax |
fixed_code |
str | Complete corrected Python snippet |
explanation |
str | Technical explanation of the fix |
This is an illustrative example of how agents interact with the environment:
{
"action": {
"bug_line": 2,
"bug_type": "logic",
"fixed_code": "def double_all(lst):\n result = []\n for i in range(len(lst)):\n result.append(lst[i] * 2)\n return result",
"explanation": "Fixed the off-by-one bug by iterating across the full list instead of len(lst) - 1."
}
}Response:
{
"observation": {
"task_id": "easy_01",
"code_snippet": "def double_all(lst):\n result = []\n for i in range(len(lst) - 1):\n result.append(lst[i] * 2)\n return result",
"task_description": "double_all should return a new list with every element doubled. The current implementation has an off-by-one error — it skips the last element.",
"test_hint": "Tested with: ->, ->, []->[], result must be a list",
"feedback": "All failing tests fixed. No regressions introduced.",
"attempt_number": 1,
"score_so_far": 0.999,
"difficulty": "easy"
},
"reward": 0.999,
"done": true
}| Field | Type | Description |
|---|---|---|
code_snippet |
str | Buggy Python code to debug |
task_description |
str | Detailed requirements |
test_hint |
str | Test case information |
feedback |
str | Grader output from previous attempt |
attempt_number |
int | Current attempt (1–5) |
score_so_far |
float | Best score this episode |
difficulty |
str | easy / medium / hard |
reward |
float | Delta reward (0.001–0.999) |
done |
bool | True when episode ends |
- Reward = (tests_fixed / total_failing) − (tests_broken / total_passing)
- Score × 0.6 (−40%) if agent introduces: eval()/exec(), bare except:, hardcoded credentials, or infinite while True loops
- Hard security tasks verify removal of dangerous patterns and presence of safe alternatives
All scores strictly clamped between 0.001 and 0.999.
BugHunterRL is designed as a meaningful RL benchmark that tests rigorous reasoning rather than simple pattern matching:
- Regression Test Oracle: Agents must fix specific failing tests without breaking existing passing behavior; rewards are highly sensitive to regressions.
- Security-aware tasks: Hard tasks require removing deep-seated vulnerabilities like SQL injection, weak hashes, and unsafe shell usage rather than superficial edits.
- Code-smell penalty: AST-based penalty for
eval()/exec(), bareexcept:, hardcoded secrets, and infinite loops discourages mechanical reward hacking. - Multi-step reasoning: Significant bugs involve mutable default arguments or cross-module inconsistencies, which cannot be solved by single-line patches.
- Randomized variants: A portion of task variants are randomized to reduce memorization and force agents to generalize their debugging logic.
| Task ID | Bug | Type |
|---|---|---|
| easy_01 | Off-by-one in list doubler | logic |
| easy_02 | IndexError in palindrome checker | runtime |
| easy_03 | Missing assignment (count+1 vs count+=1) | logic |
| easy_04 | Product initialized to 0 instead of 1 | logic |
| Task ID | Bug | Type |
|---|---|---|
| medium_01 | Infinite recursion (lst not sliced) | runtime |
| medium_02 | Float division in binary search | runtime |
| medium_03 | Wrong return variable | logic |
| medium_04 | Wrong return variable | logic |
| Task ID | Bug | Type |
|---|---|---|
| hard_01 | Mutable default argument | mutable_state |
| hard_02 | SQL Injection via f-string | security |
| hard_03 | Weak MD5 password hashing | security |
| hard_04 | OS command injection via shell=True | security |
| hard_05 | Cross-module typo superuser vs super_user | logic |
| Difficulty | Avg Score |
|---|---|
| Easy | 0.85 |
| Medium | 0.72 |
| Hard | 0.48 |
| Overall | 0.68 |
git clone https://huggingface.co/spaces/raunit19/code-debugger-env
cd code-debugger-env
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:.
python server/app.pycurl http://localhost:7860/health
# {"status": "healthy"}export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="your_token_here"
export ENV_BASE_URL="http://localhost:7860"
python inference.pyEvaluator-facing logs are emitted through the standardized [START], [STEP], and [END] format for deterministic parsing.
Follow these steps to quickly verify the environment and baseline evaluation.
- Open the live Space: https://raunit19-code-debugger-env.hf.space/
- Check the health endpoint:
/healthshould return{"status": "healthy"}. - Use
/docsto callPOST /resetand inspect the initial observation. - Run the baseline evaluation script locally:
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="your_huggingface_token"
python inference.pyinference.py emits standardized [START], [STEP], and [END] logs to stdout for the OpenEnv evaluator.