Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/cli/lol.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ Report Claude Code token usage statistics.
lol usage [--today | --week]
```

Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket.
Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket. Assistant entries with the same `message.id` within a session are deduplicated to avoid double-counting streamed content blocks. Cost estimates use cache-tier pricing when cache token fields are present.

#### Options

Expand Down
175 changes: 175 additions & 0 deletions python/agentize/eval/eval-report-2026-03-16-swebench-20.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# SWE-bench 20-Task Evaluation Report

**Date:** 2026-03-16 (costs corrected 2026-03-24)
**Benchmark:** SWE-bench Verified (20 astropy tasks)
**Modes tested:** cc.r (raw), cc.nl (nlcmd), cc.script (full with Claude consensus)
**Scoring:** SWE-bench Docker evaluator via podman

## Executive Summary

We scaled the SWE-bench evaluation from 5 to 20 tasks across three orchestration modes. The hypothesis **cc.r < cc.nl < cc.script** is **partially confirmed**: script orchestration (cc.script) resolves the most tasks at 60%, followed by NL orchestration (cc.nl) at 55%, and raw (cc.r) at 50%. Planning provides a consistent +5-10pp improvement — cc.script costs **~15x more** than cc.r for a 10pp gain.

## Results

### Resolve Rates

| Mode | name | Resolved | Rate | Cost | Time |
|------|-------------|----------|------|------|------|
| `--mode raw` | **cc.r** | 10/20 | 50% | $4.64 | 34 min |
| `--mode nlcmd` | **cc.nl** | 11/20 | 55% | ~$8.94* | ~52 min* |
| `--mode full` | **cc.script** | 12/20 | 60% | $70.62 | 3.9 hrs |

*\*cc.nl cost/time estimated from 6 measured tasks (resume overwrite lost first 14 tasks' metrics).*

> **Cost correction note (2026-03-24):** Original costs ($1.47 cc.r, $195.30 cc.script) were inaccurate due to two bugs: (1) raw mode ignored cache-tier pricing, undercounting by ~3x; (2) JSONL-based modes double-counted streaming content blocks, overcounting by ~1.7x. See [#982](https://github.com/Synthesys-Lab/agentize/issues/982). Re-run with fixed cost tracking on 2026-03-24.

### Per-Instance Results

| Instance | cc.r | cc.nl | cc.script |
|----------|------|-------|-----------|
| astropy-12907 | PASS | PASS | PASS |
| astropy-13033 | FAIL | FAIL | FAIL |
| astropy-13236 | FAIL | FAIL | FAIL |
| astropy-13398 | FAIL | FAIL | FAIL |
| astropy-13453 | PASS | PASS | PASS |
| astropy-13579 | PASS | PASS | PASS |
| astropy-13977 | FAIL | FAIL | FAIL |
| astropy-14096 | PASS | PASS | PASS |
| astropy-14182 | FAIL | FAIL | FAIL |
| astropy-14309 | PASS | PASS | PASS |
| **astropy-14365** | FAIL | FAIL | **PASS** |
| astropy-14369 | FAIL | FAIL | FAIL |
| astropy-14508 | PASS | PASS | PASS |
| astropy-14539 | PASS | PASS | PASS |
| astropy-14598 | FAIL | FAIL | FAIL |
| astropy-14995 | PASS | PASS | PASS |
| **astropy-7166** | FAIL | **PASS** | **PASS** |
| astropy-7336 | PASS | PASS | PASS |
| astropy-7606 | FAIL | FAIL | FAIL |
| astropy-7671 | PASS | PASS | PASS |

### Token Usage & Timing

| Mode | Tokens total | Tokens mean | Time total | Time mean |
|------|-------------|-------------|------------|-----------|
| **cc.r** | 102,658 | 5,133 | 2,034s (34 min) | 102s |
| **cc.nl** | ~123,763* | ~6,188* | ~3,134s* (~52 min) | ~157s* |
| **cc.script** | 152,796 | 7,640 | 14,241s (3.9 hrs) | 712s |

## Analysis

### Finding 1: cc.r < cc.nl < cc.script confirmed (barely)

The ordering holds but the margins are thin:

| Comparison | Resolve delta | Cost delta |
|---|---|---|
| cc.nl vs cc.r | +1 task (+5pp) | ~2x more ($8.94 vs $4.64) |
| cc.script vs cc.nl | +1 task (+5pp) | ~8x more ($70.62 vs $8.94) |
| cc.script vs cc.r | +2 tasks (+10pp) | **~15x more** ($70.62 vs $4.64) |

Each step up in orchestration complexity solves exactly one additional task. The cost scaling is superlinear — diminishing returns at increasing cost, though the ~15x ratio is much more reasonable than originally reported.

### Finding 2: A core of 10 tasks is "easy" — all modes solve them

All three modes agree on 10 tasks (50%): astropy-12907, 13453, 13579, 14096, 14309, 14508, 14539, 14995, 7336, 7671. These are likely straightforward bugs where even raw `claude -p` can produce the right fix.

### Finding 3: A core of 8 tasks is "hard" — no mode solves them

All three modes fail on 8 tasks (40%): astropy-13033, 13236, 13398, 13977, 14182, 14369, 14598, 7606. These are beyond what current orchestration can address, regardless of planning sophistication.

### Finding 4: The interesting zone is 2 tasks (10%)

Only 2 tasks differentiate the modes:

| Task | cc.r | cc.nl | cc.script | What helps |
|------|------|-------|-----------|------------|
| **astropy-7166** | FAIL | PASS | PASS | Any planning helps |
| **astropy-14365** | FAIL | FAIL | PASS | Only script planning helps |

- **astropy-7166**: Both planning modes solve it, raw doesn't. Planning provides enough context to find the right approach.
- **astropy-14365**: Only the 5-agent debate (cc.script) solves it. The NL planning in cc.nl is insufficient — the structured critique/reducer/consensus pipeline catches something the single-shot NL planner misses.

### Finding 5: cc.nl is the best cost-performance tradeoff

| Mode | $/resolved task | Marginal cost per extra task |
|------|----------------|------------------------------|
| **cc.r** | $0.46 | — |
| **cc.nl** | $0.81 | ~$4.30 for +1 task |
| **cc.script** | $5.89 | ~$61.68 for +1 task over cc.nl |

cc.nl gets 55% resolve rate at ~$9 total — 8x cheaper than cc.script for only 1 fewer resolved task. The marginal cost of cc.script's extra task (astropy-14365) is ~$62.

### Finding 6: Scaling from 5 to 20 tasks changed the picture

| Mode | 5-task rate | 20-task rate | Delta |
|------|-----------|-------------|-------|
| cc.r | 20% (1/5) | 50% (10/20) | +30pp |
| cc.script | 40% (2/5) | 60% (12/20) | +20pp |

The 5-task sample was pessimistic — the first 5 astropy tasks happened to be harder than average. At 20 tasks, raw mode's 50% baseline is much stronger, and the relative advantage of planning shrinks from +20pp to +10pp.

## Limitations

1. **Single repository** — All 20 tasks are from astropy. Results may not generalize to other Python projects or languages.
2. **cc.nl metrics estimated** — Resume bug overwrote first 14 tasks' cost/time data. Cost extrapolated from 6 measured tasks.
3. **codex.script not completed** — Codex API rate limits prevented completing the 20-task run (only 3/20 finished). Cannot compare cc.script vs codex.script at this scale.
4. **Single run** — No repeated trials to measure variance. The cc.r and cc.script re-runs (2026-03-24) are on newer model versions and may differ in resolve rates.
5. **No impl mode** — `--mode impl` was not run at 20-task scale for this report.
6. **cc.nl not re-run** — cc.nl costs were not re-measured with the corrected cost tracking; the ~$8.94 figure remains an estimate from the original run.

## Recommendations

1. **Use cc.nl as default** — best cost-performance ratio (55% at ~$9 vs 60% at $71).
2. **Use cc.script for high-stakes tasks** — when correctness matters more than cost, the extra ~$62/task buys +5pp.
3. **Complete codex.script evaluation** — needed to test the hypothesis cc.script ≈ codex.script.
4. **Investigate the 8 hard-fail tasks** — root-cause analysis may reveal systematic pipeline gaps.
5. **Expand to other repositories** — astropy-only evaluation limits generalizability.

## Appendix A: Cost Efficiency

| Mode | Cost/task | Time/task | $/second | Tokens/task |
|------|-----------|-----------|----------|-------------|
| **cc.r** | $0.23 | 102s | $0.0023 | 5,133 |
| **cc.nl** | ~$0.45 | ~157s | ~$0.003 | ~6,188 |
| **cc.script** | $3.53 | 712s | $0.005 | 7,640 |

## Appendix B: Model-Level Cost Breakdown (2026-03-25)

### Raw Opus Baseline

Running raw mode with opus instead of sonnet validates cost proportionality:

| Mode | Model | Tokens | Cost | Cost/task | Time |
|------|-------|--------|------|-----------|------|
| `--mode raw` | sonnet | 102,658 | $4.64 | $0.23 | 34 min |
| `--mode raw` | opus | 66,139 | $27.32 | $1.37 | 49 min |
| `--mode full` | opus plan + sonnet impl | 152,796 | $70.62 | $3.53 | 3.9 hrs |

**Key insight:** Raw opus costs $27.32 — the 5.9x ratio vs raw sonnet ($4.64) matches the ~5x opus/sonnet pricing difference. Full mode's $70.62 breaks down as: raw opus baseline ($27) + planning pipeline overhead (~$43). The planning overhead is ~1.6x the raw opus cost, not the 15x it appears when comparing against raw sonnet.

### Codex Implementation Backend — Iteration Loop Bug

Running full mode with `--planner-backend claude:opus --impl-backend codex:gpt-5.2-codex` revealed a critical iteration bug. On task 1 (astropy-12907), the Codex implementation agent failed to produce the FSM completion marker, causing the harness to loop:

```
Iteration 1: impl-iter-1 (codex:gpt-5.2-codex) runs 83s → "completion marker missing"
Iteration 2: impl-iter-2 (codex:gpt-5.2-codex) runs 158s → "completion marker missing"
Iteration 3: impl-iter-3 (codex:gpt-5.2-codex) runs 81s → "completion marker missing"
Iteration 4: impl-iter-4 (codex:gpt-5.2-codex) runs 193s → "completion marker missing"
Iteration 5: (killed by user)
```

The Codex backend produced the correct patch in iteration 1 but never emitted the completion signal. Each subsequent iteration found "no changes to commit" yet still burned ~100-200s. This multi-iteration loop is the likely cause of unexpectedly long run times when using Codex as the implementation backend. The sonnet implementation backend consistently completes in exactly 1 iteration across all 20 tasks.

### nlcmd Mode — Planning Timeout Analysis (2026-03-25)

Re-running nlcmd with corrected cost tracking revealed that 7/20 tasks (35%) consistently timeout during Phase 1 (NL planning via `claude -p "/ultra-planner"`), even across two attempts with the 900s planning budget:

| Attempt | Completed | Timed out | Cost |
|---------|-----------|-----------|------|
| Run 1 (20 tasks) | 10 | 10 | $101.22 |
| Run 2 (10 retried) | 3 | 7 | $57.85 |
| **Combined** | **13** | **7** | **$159.07** |

The nlcmd mode is paradoxically the most expensive ($159 total) due to timeout waste — the timed-out planning phases still consume tokens and cost without producing patches. The 7 persistently-failing tasks (13236, 13398, 13977, 14369, 14508, 14598, 7671) suggest the `claude -p` NL dispatch adds overhead that pushes the 5-agent debate past the 900s budget.
7 changes: 6 additions & 1 deletion python/agentize/eval/eval_harness.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ The harness supports four execution modes via `--mode`:

| Mode | What runs | What it tests | Cost tracking |
|------|-----------|---------------|---------------|
| `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage |
| `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage (cache-tier aware when provided) |
| `impl` | FSM orchestrator only (no planning) | The impl kernel loop | JSONL session files |
| `full` | Planning pipeline + FSM orchestrator | The agentize framework | JSONL session files |
| `nlcmd` | NL planning via `claude -p` + FSM | NL orchestration | JSONL session files |
Expand Down Expand Up @@ -154,6 +154,11 @@ Run costs scale with task count and per-task timeout. Validate incrementally:

## Cost Estimation

For JSONL-based modes (`impl`, `full`, `nlcmd`), cost is tracked via session file
diffing with per-session deduplication by `message.id` to avoid counting streamed
content blocks multiple times. Raw mode uses `claude -p` JSON output with
cache-tier-aware pricing when cache token fields are present.

Per-task costs depend on the model and task complexity. Rough estimates:

| Model | Tokens/task (est.) | Cost/task (est.) | 300 tasks |
Expand Down
47 changes: 39 additions & 8 deletions python/agentize/eval/eval_harness.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,43 +61,65 @@ def _make_result(instance_id: str) -> dict:
}


def _compute_cost(input_tokens: int, output_tokens: int, model: str) -> float:
def _compute_cost(
input_tokens: int,
output_tokens: int,
model: str,
cache_read: int = 0,
cache_write: int = 0,
) -> float:
"""Compute estimated USD cost from token counts and model short-name.

Falls back to 0.0 if the model is not in the pricing table.
When cache_read/cache_write are provided, applies cache-tier pricing:
non-cache input at base rate, cache reads at cache_read rate, cache
writes at cache_write rate. Falls back to 0.0 if model is unknown.
"""
from agentize.usage import match_model_pricing

model_id = _MODEL_ID_MAP.get(model, model)
rates = match_model_pricing(model_id)
if not rates:
return 0.0
non_cache = max(0, input_tokens - cache_read - cache_write)
return (
input_tokens * rates["input"] / 1_000_000
non_cache * rates["input"] / 1_000_000
+ output_tokens * rates["output"] / 1_000_000
+ cache_read * rates["cache_read"] / 1_000_000
+ cache_write * rates["cache_write"] / 1_000_000
)


def _parse_claude_usage(stdout: str, model: str) -> dict:
"""Parse ``claude -p --output-format json`` output for token/cost data.

Returns a dict with keys: input_tokens, output_tokens, tokens, cost_usd.
Returns a dict with keys: input_tokens, output_tokens, tokens,
cache_read_tokens, cache_write_tokens, cost_usd.
Returns zeroes on parse failure.
"""
result = {"input_tokens": 0, "output_tokens": 0, "tokens": 0, "cost_usd": 0.0}
result = {
"input_tokens": 0, "output_tokens": 0, "tokens": 0,
"cache_read_tokens": 0, "cache_write_tokens": 0, "cost_usd": 0.0,
}
if not stdout:
return result
try:
data = json.loads(stdout)
usage = data.get("usage", {})
inp = usage.get("input_tokens", 0)
out = usage.get("output_tokens", 0)
cache_read = usage.get("cache_read_input_tokens", 0)
cache_write = usage.get("cache_creation_input_tokens", 0)
result["input_tokens"] = inp
result["output_tokens"] = out
result["tokens"] = inp + out
result["cache_read_tokens"] = cache_read
result["cache_write_tokens"] = cache_write
# Use model from JSON if available, else fall back to caller's model
json_model = data.get("model", "")
result["cost_usd"] = _compute_cost(inp, out, json_model or model)
result["cost_usd"] = _compute_cost(
inp, out, json_model or model,
cache_read=cache_read, cache_write=cache_write,
)
except (json.JSONDecodeError, KeyError):
pass
return result
Expand Down Expand Up @@ -130,6 +152,7 @@ def _sum_jsonl_usage(paths: list[str]) -> dict:
"tokens": 0, "cost_usd": 0.0,
}
for path in paths:
seen_msg_ids: set[str] = set() # dedup per file
try:
with open(path, "r", encoding="utf-8", errors="ignore") as f:
for line in f:
Expand All @@ -140,6 +163,13 @@ def _sum_jsonl_usage(paths: list[str]) -> dict:
entry = json.loads(line)
if entry.get("type") == "assistant":
message = entry.get("message", {})
# Deduplicate by message.id (streaming produces
# multiple JSONL lines per API response)
msg_id = message.get("id", "")
if msg_id:
if msg_id in seen_msg_ids:
continue
seen_msg_ids.add(msg_id)
usage = message.get("usage", {})
inp = usage.get("input_tokens", 0)
out = usage.get("output_tokens", 0)
Expand Down Expand Up @@ -1227,8 +1257,9 @@ def main(argv: list[str] | None = None) -> int:

def _cmd_run(args) -> int:
"""Execute the ``run`` subcommand."""
# Auto-append mode to output dir so raw/full don't overwrite each other
base_dir = Path(args.output_dir)
# Auto-append mode to output dir so raw/full don't overwrite each other.
# Resolve to absolute paths because full mode os.chdir()s into worktrees.
base_dir = Path(args.output_dir).resolve()
output_dir = base_dir / args.mode
# Share repo cache across modes (clones are expensive)
repos_dir = base_dir / "repos"
Expand Down
2 changes: 2 additions & 0 deletions python/agentize/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ Dict mapping bucket keys to stats:
- Scans `~/.claude/projects/**/*.jsonl` files
- Filters by modification time (24h for today, 7d for week)
- Extracts `input_tokens` and `output_tokens` from assistant messages
- Deduplicates assistant entries that share the same `message.id` within a session file
- Cost estimation applies cache_read/cache_write tiers when present
- Counts unique sessions (one JSONL file = one session)
- Returns empty buckets if `~/.claude/projects` doesn't exist
- Cache tokens: Extracts `cache_read_input_tokens` and `cache_creation_input_tokens` when `include_cache=True`
Expand Down
8 changes: 8 additions & 0 deletions python/agentize/usage.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ def make_bucket():

# Parse JSONL file line by line for memory efficiency
file_has_usage = False
seen_msg_ids: set[str] = set() # dedup per file
with open(jsonl_path, "r", encoding="utf-8", errors="ignore") as f:
for line in f:
line = line.strip()
Expand All @@ -137,6 +138,13 @@ def make_bucket():
# Extract usage from assistant messages
if entry.get("type") == "assistant":
message = entry.get("message", {})
# Deduplicate by message.id (streaming produces
# multiple JSONL lines per API response)
msg_id = message.get("id", "")
if msg_id:
if msg_id in seen_msg_ids:
continue
seen_msg_ids.add(msg_id)
usage = message.get("usage", {})
input_tokens = usage.get("input_tokens", 0)
output_tokens = usage.get("output_tokens", 0)
Expand Down
19 changes: 19 additions & 0 deletions python/tests/test_eval_harness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# test_eval_harness.py

Tests for `agentize.eval.eval_harness` pure functions.

## Scope

- `_compute_cost`: Cache-tier-aware pricing with `cache_read` and `cache_write` parameters
- `_parse_claude_usage`: Cache token extraction from `claude -p` JSON output
- `_sum_jsonl_usage`: Deduplication by `message.id` to avoid counting streamed content blocks
- `aggregate_metrics`: Cost aggregation across task results
- `extract_patch`: Git diff extraction from worktrees
- `write_overrides`: Shell override generation for eval isolation

## Test Data

Tests use inline JSON strings and `tmp_path` fixtures for JSONL files with:
- Duplicate `message.id` entries (simulating streaming content blocks)
- Cache token fields (`cache_read_input_tokens`, `cache_creation_input_tokens`)
- Mixed entries with and without `message.id`
Loading
Loading