Small open-source repos (< 500 stars) accumulate issue debt fast: duplicates, vague reports, stale PRs, and items that will never be fixed. Maintainers spend hours triaging manually. This environment trains an agent to do that work — consistently and efficiently — using the same judgment a seasoned maintainer would apply.
| Task | Score | Items | Key Challenge |
|---|---|---|---|
easy |
1.0000 | 10 | Clean backlog, one obvious duplicate pair |
medium |
0.9363 | 30 | Noisy descriptions, multiple duplicate clusters |
hard |
0.9257 | 61 | Security issues, cascading duplicates, two-release scope |
Grading weights: Triage accuracy 55% · Priority accuracy 35% · Duplicate accuracy 10%
Run
python inference.pyand paste your scores above.
The grader is a pure function of agent decisions — no randomness, no external calls.
easy : 0.4150 × 5 → ✅ DETERMINISTIC
medium : 0.6029 × 5 → ✅ DETERMINISTIC
hard : 0.3882 × 5 → ✅ DETERMINISTIC
Run
python test_determinism.pyand paste results above.
| Field | Type | Description |
|---|---|---|
task_id |
str |
easy, medium, or hard |
issues |
List[Issue] |
Full backlog with id, title, body, labels, author_type, age_days, linked_prs, triage_state, priority |
step_number |
int |
Current step count |
available_actions |
List[str] |
triage_item, mark_duplicate, set_priority, done |
items_remaining |
int |
Count of untriaged items |
| Action | Payload | Description |
|---|---|---|
triage_item |
{item_id, decision, comment?} |
Assign keep / close / need-info / duplicate |
mark_duplicate |
{item_id, duplicate_of} |
Link item to its canonical lower-numbered issue |
set_priority |
{item_id, priority} |
Assign next_release / backlog / wont_fix — kept items only |
done |
{} |
End grooming session |
| Decision | Score |
|---|---|
| Correct triage | +0.10 |
| Correct duplicate + correct target | +0.12 |
| Correct priority | +0.08 |
| Priority off by one level | +0.02 |
| Wrong close of valid / need-info issue | −0.15 |
| Wrong duplicate on non-duplicate | −0.08 |
| Correct duplicate, wrong target | −0.05 |
| Prioritizing closed / duplicate item | −0.05 |
| Loop penalty (repeated identical action) | −0.05 × repeat |
| Variable | Default | Required |
|---|---|---|
HF_TOKEN |
— | ✅ Yes |
API_BASE_URL |
https://openrouter.ai/api/v1 |
No |
MODEL_NAME |
qwen/qwen3.6-plus:free |
No |
pip install -r requirements.txt# PowerShell
$env:HF_TOKEN="sk-or-v1-..."
python inference.py
# Custom endpoint (Groq recommended for stability)
$env:API_BASE_URL="https://api.groq.com/openai/v1"
$env:MODEL_NAME="llama-3.1-8b-instant"
python inference.pydocker build -t issue-grooming-env .
docker run -e HF_TOKEN=$HF_TOKEN issue-grooming-envfrom env import IssueGroomingEnv, Action
env = IssueGroomingEnv(task_id="easy")
obs = env.reset()
obs, reward, done, info = env.step(
Action(action_type="triage_item", payload={"item_id": 1, "decision": "keep"})
)
print(reward.message) # Triaged #1 as 'keep'. Score: +0.10
print(env.grade()) # 0.0–1.0issue-grooming-env/
├── env/
│ ├── __init__.py
│ ├── environment.py # OpenEnv class · reset / step /state / grade
│ ├── models.py # Pydantic: Issue, Observation, Action, Reward
│ ├── tasks/
│ │ ├── task_easy.py
│ │ ├── task_medium.py
│ │ └── task_hard.py
│ └── graders/
│ ├── grader_easy.py
│ ├── grader_medium.py
│ └── grader_hard.py
├── inference.py # Baseline LLM agent · hackathon entry point
├── test_determinism.py # Proves grader is deterministic
├── openenv.yaml
├── Dockerfile
├── requirements.txt # openai>=1.0.0, pydantic>=2.0.0
└── README.md