Skip to content

dravidpa7/Issuse-grooming-OpenEnv

Repository files navigation

issue-grooming-env

Release

An OpenEnv-compatible RL environment simulating open-source issue grooming.

OpenEnv Python Pydantic Tasks


Overview & Motivation

Small open-source repos (< 500 stars) accumulate issue debt fast: duplicates, vague reports, stale PRs, and items that will never be fixed. Maintainers spend hours triaging manually. This environment trains an agent to do that work — consistently and efficiently — using the same judgment a seasoned maintainer would apply.


Baseline Performance

Task Score Items Key Challenge
easy 1.0000 10 Clean backlog, one obvious duplicate pair
medium 0.9363 30 Noisy descriptions, multiple duplicate clusters
hard 0.9257 61 Security issues, cascading duplicates, two-release scope

Grading weights: Triage accuracy 55% · Priority accuracy 35% · Duplicate accuracy 10%

Run python inference.py and paste your scores above.


Determinism Verification

The grader is a pure function of agent decisions — no randomness, no external calls.

easy    : 0.4150 × 5 → ✅ DETERMINISTIC  
medium  : 0.6029 × 5 → ✅ DETERMINISTIC  
hard    : 0.3882 × 5 → ✅ DETERMINISTIC  

Run python test_determinism.py and paste results above.


Observation Space

Field Type Description
task_id str easy, medium, or hard
issues List[Issue] Full backlog with id, title, body, labels, author_type, age_days, linked_prs, triage_state, priority
step_number int Current step count
available_actions List[str] triage_item, mark_duplicate, set_priority, done
items_remaining int Count of untriaged items

Action Space

Action Payload Description
triage_item {item_id, decision, comment?} Assign keep / close / need-info / duplicate
mark_duplicate {item_id, duplicate_of} Link item to its canonical lower-numbered issue
set_priority {item_id, priority} Assign next_release / backlog / wont_fix — kept items only
done {} End grooming session

Reward Signals

Decision Score
Correct triage +0.10
Correct duplicate + correct target +0.12
Correct priority +0.08
Priority off by one level +0.02
Wrong close of valid / need-info issue −0.15
Wrong duplicate on non-duplicate −0.08
Correct duplicate, wrong target −0.05
Prioritizing closed / duplicate item −0.05
Loop penalty (repeated identical action) −0.05 × repeat

Environment Variables

Variable Default Required
HF_TOKEN ✅ Yes
API_BASE_URL https://openrouter.ai/api/v1 No
MODEL_NAME qwen/qwen3.6-plus:free No

Setup & Usage

pip install -r requirements.txt
# PowerShell
$env:HF_TOKEN="sk-or-v1-..."
python inference.py

# Custom endpoint (Groq recommended for stability)
$env:API_BASE_URL="https://api.groq.com/openai/v1"
$env:MODEL_NAME="llama-3.1-8b-instant"
python inference.py

Docker

docker build -t issue-grooming-env .
docker run -e HF_TOKEN=$HF_TOKEN issue-grooming-env

Use as a library

from env import IssueGroomingEnv, Action

env = IssueGroomingEnv(task_id="easy")
obs = env.reset()

obs, reward, done, info = env.step(
    Action(action_type="triage_item", payload={"item_id": 1, "decision": "keep"})
)
print(reward.message)   # Triaged #1 as 'keep'. Score: +0.10
print(env.grade())      # 0.0–1.0

Repository Structure

issue-grooming-env/
├── env/
│   ├── __init__.py
│   ├── environment.py        # OpenEnv class · reset / step /state / grade
│   ├── models.py             # Pydantic: Issue, Observation, Action, Reward
│   ├── tasks/
│   │   ├── task_easy.py
│   │   ├── task_medium.py
│   │   └── task_hard.py
│   └── graders/
│       ├── grader_easy.py
│       ├── grader_medium.py
│       └── grader_hard.py
├── inference.py              # Baseline LLM agent · hackathon entry point
├── test_determinism.py       # Proves grader is deterministic
├── openenv.yaml
├── Dockerfile
├── requirements.txt          # openai>=1.0.0, pydantic>=2.0.0
└── README.md