Skip to content

raunitx-02/code-debugger-env

Repository files navigation

title Code Debugger Env
emoji 🐞
colorFrom red
colorTo indigo
sdk docker
pinned false
app_port 7860

BugHunterRL: RL for Automated Code Debugging

Submission for Meta × PyTorch OpenEnv Hackathon @ Scaler
13 real-world Python debugging tasks • Regression Test Oracle • Code Smell AST Penalty
Deployed on HF Spaces • FastAPI + Docker • OpenEnv Core 0.2.1

HF Space OpenEnv PyTorch Ready Python 3.11


🌟 Why BugHunterRL?

BugHunterRL is a production-grade OpenEnv environment for training and evaluating RL agents on real-world Python debugging and security auditing. Agents must fix actual bugs, pass regression tests, and avoid introducing dangerous code patterns.

Capability Description
Regression Test Oracle Every task has failing_tests (must fix) + passing_tests (must not break)
Code Smell AST Penalty -40% score if agent introduces eval(), bare except, hardcoded secrets, or infinite loops
Security Grader Detects SQL injection, OS command injection, and weak hashing
Multi-File Simulation Hard tasks simulate cross-module dependency bugs
Dynamic Randomization 30% chance of randomized task variant to prevent memorization

🏗️ Environment Specifications

Feature Specification
API Type RESTful OpenAI-compatible (FastAPI)
SDK openenv-core==0.2.1
Task Count 13 Graded Tasks
Difficulty Tiers Easy (4), Medium (4), Hard (5)
Reward Range Strictly (0.001, 0.999) — Phase-2 validator compliant
Deployment Docker-based Hugging Face Space
Max Episode Steps 5 (all difficulties)
Inference Timeout 1200 seconds

🔌 API Endpoints

Endpoint Method Description
/ GET Root — provides status and endpoint directory
/reset POST Start new episode, returns first observation
/step POST Submit action, returns reward + observation
/state GET Returns current episode state
/health GET Health check — returns {"status": "healthy"}
/metadata GET Environment metadata
/stats GET Live runtime statistics
/schema GET Returns JSON schemas for actions, observations, and state

Verified Deployment

The BugHunterRL environment is publicly deployed and reachable on Hugging Face Spaces. The API has been manually verified on the live Space to ensure zero-latency readiness for evaluation.


🎮 Action Space

Agents submit a CodeDebugAction to /step:

Field Type Description
bug_line int 1-indexed line number of the bug
bug_type str logic / runtime / security / mutable_state / syntax
fixed_code str Complete corrected Python snippet
explanation str Technical explanation of the fix

Example /step Interaction

This is an illustrative example of how agents interact with the environment:

{
  "action": {
    "bug_line": 2,
    "bug_type": "logic",
    "fixed_code": "def double_all(lst):\n    result = []\n    for i in range(len(lst)):\n        result.append(lst[i] * 2)\n    return result",
    "explanation": "Fixed the off-by-one bug by iterating across the full list instead of len(lst) - 1."
  }
}

Response:

{
  "observation": {
    "task_id": "easy_01",
    "code_snippet": "def double_all(lst):\n    result = []\n    for i in range(len(lst) - 1):\n        result.append(lst[i] * 2)\n    return result",
    "task_description": "double_all should return a new list with every element doubled. The current implementation has an off-by-one error — it skips the last element.",
    "test_hint": "Tested with: ->, ->, []->[], result must be a list",
    "feedback": "All failing tests fixed. No regressions introduced.",
    "attempt_number": 1,
    "score_so_far": 0.999,
    "difficulty": "easy"
  },
  "reward": 0.999,
  "done": true
}

🔍 Observation Space

Field Type Description
code_snippet str Buggy Python code to debug
task_description str Detailed requirements
test_hint str Test case information
feedback str Grader output from previous attempt
attempt_number int Current attempt (1–5)
score_so_far float Best score this episode
difficulty str easy / medium / hard
reward float Delta reward (0.001–0.999)
done bool True when episode ends

📊 Grading System

Layer 1: Regression Test Oracle

  • Reward = (tests_fixed / total_failing) − (tests_broken / total_passing)

Layer 2: Code Smell Penalty (AST-based)

  • Score × 0.6 (−40%) if agent introduces: eval()/exec(), bare except:, hardcoded credentials, or infinite while True loops

Layer 3: Security Pattern Detection

  • Hard security tasks verify removal of dangerous patterns and presence of safe alternatives

All scores strictly clamped between 0.001 and 0.999.


Why this environment is hard for agents

BugHunterRL is designed as a meaningful RL benchmark that tests rigorous reasoning rather than simple pattern matching:

  • Regression Test Oracle: Agents must fix specific failing tests without breaking existing passing behavior; rewards are highly sensitive to regressions.
  • Security-aware tasks: Hard tasks require removing deep-seated vulnerabilities like SQL injection, weak hashes, and unsafe shell usage rather than superficial edits.
  • Code-smell penalty: AST-based penalty for eval()/exec(), bare except:, hardcoded secrets, and infinite loops discourages mechanical reward hacking.
  • Multi-step reasoning: Significant bugs involve mutable default arguments or cross-module inconsistencies, which cannot be solved by single-line patches.
  • Randomized variants: A portion of task variants are randomized to reduce memorization and force agents to generalize their debugging logic.

🗂️ Task Catalog

Easy (4 tasks)

Task ID Bug Type
easy_01 Off-by-one in list doubler logic
easy_02 IndexError in palindrome checker runtime
easy_03 Missing assignment (count+1 vs count+=1) logic
easy_04 Product initialized to 0 instead of 1 logic

Medium (4 tasks)

Task ID Bug Type
medium_01 Infinite recursion (lst not sliced) runtime
medium_02 Float division in binary search runtime
medium_03 Wrong return variable logic
medium_04 Wrong return variable logic

Hard (5 tasks)

Task ID Bug Type
hard_01 Mutable default argument mutable_state
hard_02 SQL Injection via f-string security
hard_03 Weak MD5 password hashing security
hard_04 OS command injection via shell=True security
hard_05 Cross-module typo superuser vs super_user logic

📈 Baseline Scores (Meta Llama 3.1 8B)

Difficulty Avg Score
Easy 0.85
Medium 0.72
Hard 0.48
Overall 0.68

🚀 Quickstart

Run Locally

git clone https://huggingface.co/spaces/raunit19/code-debugger-env
cd code-debugger-env
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:.
python server/app.py

Verify

curl http://localhost:7860/health
# {"status": "healthy"}

🤖 Reproduce Baseline Evaluation

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="your_token_here"
export ENV_BASE_URL="http://localhost:7860"
python inference.py

Evaluator-facing logs are emitted through the standardized [START], [STEP], and [END] format for deterministic parsing.


Reproduce in 60 seconds

Follow these steps to quickly verify the environment and baseline evaluation.

  1. Open the live Space: https://raunit19-code-debugger-env.hf.space/
  2. Check the health endpoint: /health should return {"status": "healthy"}.
  3. Use /docs to call POST /reset and inspect the initial observation.
  4. Run the baseline evaluation script locally:
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="your_huggingface_token"
python inference.py

inference.py emits standardized [START], [STEP], and [END] logs to stdout for the OpenEnv evaluator.

About

OpenEnv code-debugging RL environment for the Meta × PyTorch Hackathon — 13 graded tasks, FastAPI, Docker, Hugging Face Spaces

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors