diff --git a/docs/cli/lol.md b/docs/cli/lol.md index 0fac391f..86537b50 100644 --- a/docs/cli/lol.md +++ b/docs/cli/lol.md @@ -108,7 +108,7 @@ Report Claude Code token usage statistics. lol usage [--today | --week] ``` -Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket. +Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket. Assistant entries with the same `message.id` within a session are deduplicated to avoid double-counting streamed content blocks. Cost estimates use cache-tier pricing when cache token fields are present. #### Options diff --git a/python/agentize/eval/eval-report-2026-03-16-swebench-20.md b/python/agentize/eval/eval-report-2026-03-16-swebench-20.md new file mode 100644 index 00000000..a5973847 --- /dev/null +++ b/python/agentize/eval/eval-report-2026-03-16-swebench-20.md @@ -0,0 +1,175 @@ +# SWE-bench 20-Task Evaluation Report + +**Date:** 2026-03-16 (costs corrected 2026-03-24) +**Benchmark:** SWE-bench Verified (20 astropy tasks) +**Modes tested:** cc.r (raw), cc.nl (nlcmd), cc.script (full with Claude consensus) +**Scoring:** SWE-bench Docker evaluator via podman + +## Executive Summary + +We scaled the SWE-bench evaluation from 5 to 20 tasks across three orchestration modes. The hypothesis **cc.r < cc.nl < cc.script** is **partially confirmed**: script orchestration (cc.script) resolves the most tasks at 60%, followed by NL orchestration (cc.nl) at 55%, and raw (cc.r) at 50%. Planning provides a consistent +5-10pp improvement — cc.script costs **~15x more** than cc.r for a 10pp gain. + +## Results + +### Resolve Rates + +| Mode | name | Resolved | Rate | Cost | Time | +|------|-------------|----------|------|------|------| +| `--mode raw` | **cc.r** | 10/20 | 50% | $4.64 | 34 min | +| `--mode nlcmd` | **cc.nl** | 11/20 | 55% | ~$8.94* | ~52 min* | +| `--mode full` | **cc.script** | 12/20 | 60% | $70.62 | 3.9 hrs | + +*\*cc.nl cost/time estimated from 6 measured tasks (resume overwrite lost first 14 tasks' metrics).* + +> **Cost correction note (2026-03-24):** Original costs ($1.47 cc.r, $195.30 cc.script) were inaccurate due to two bugs: (1) raw mode ignored cache-tier pricing, undercounting by ~3x; (2) JSONL-based modes double-counted streaming content blocks, overcounting by ~1.7x. See [#982](https://github.com/Synthesys-Lab/agentize/issues/982). Re-run with fixed cost tracking on 2026-03-24. + +### Per-Instance Results + +| Instance | cc.r | cc.nl | cc.script | +|----------|------|-------|-----------| +| astropy-12907 | PASS | PASS | PASS | +| astropy-13033 | FAIL | FAIL | FAIL | +| astropy-13236 | FAIL | FAIL | FAIL | +| astropy-13398 | FAIL | FAIL | FAIL | +| astropy-13453 | PASS | PASS | PASS | +| astropy-13579 | PASS | PASS | PASS | +| astropy-13977 | FAIL | FAIL | FAIL | +| astropy-14096 | PASS | PASS | PASS | +| astropy-14182 | FAIL | FAIL | FAIL | +| astropy-14309 | PASS | PASS | PASS | +| **astropy-14365** | FAIL | FAIL | **PASS** | +| astropy-14369 | FAIL | FAIL | FAIL | +| astropy-14508 | PASS | PASS | PASS | +| astropy-14539 | PASS | PASS | PASS | +| astropy-14598 | FAIL | FAIL | FAIL | +| astropy-14995 | PASS | PASS | PASS | +| **astropy-7166** | FAIL | **PASS** | **PASS** | +| astropy-7336 | PASS | PASS | PASS | +| astropy-7606 | FAIL | FAIL | FAIL | +| astropy-7671 | PASS | PASS | PASS | + +### Token Usage & Timing + +| Mode | Tokens total | Tokens mean | Time total | Time mean | +|------|-------------|-------------|------------|-----------| +| **cc.r** | 102,658 | 5,133 | 2,034s (34 min) | 102s | +| **cc.nl** | ~123,763* | ~6,188* | ~3,134s* (~52 min) | ~157s* | +| **cc.script** | 152,796 | 7,640 | 14,241s (3.9 hrs) | 712s | + +## Analysis + +### Finding 1: cc.r < cc.nl < cc.script confirmed (barely) + +The ordering holds but the margins are thin: + +| Comparison | Resolve delta | Cost delta | +|---|---|---| +| cc.nl vs cc.r | +1 task (+5pp) | ~2x more ($8.94 vs $4.64) | +| cc.script vs cc.nl | +1 task (+5pp) | ~8x more ($70.62 vs $8.94) | +| cc.script vs cc.r | +2 tasks (+10pp) | **~15x more** ($70.62 vs $4.64) | + +Each step up in orchestration complexity solves exactly one additional task. The cost scaling is superlinear — diminishing returns at increasing cost, though the ~15x ratio is much more reasonable than originally reported. + +### Finding 2: A core of 10 tasks is "easy" — all modes solve them + +All three modes agree on 10 tasks (50%): astropy-12907, 13453, 13579, 14096, 14309, 14508, 14539, 14995, 7336, 7671. These are likely straightforward bugs where even raw `claude -p` can produce the right fix. + +### Finding 3: A core of 8 tasks is "hard" — no mode solves them + +All three modes fail on 8 tasks (40%): astropy-13033, 13236, 13398, 13977, 14182, 14369, 14598, 7606. These are beyond what current orchestration can address, regardless of planning sophistication. + +### Finding 4: The interesting zone is 2 tasks (10%) + +Only 2 tasks differentiate the modes: + +| Task | cc.r | cc.nl | cc.script | What helps | +|------|------|-------|-----------|------------| +| **astropy-7166** | FAIL | PASS | PASS | Any planning helps | +| **astropy-14365** | FAIL | FAIL | PASS | Only script planning helps | + +- **astropy-7166**: Both planning modes solve it, raw doesn't. Planning provides enough context to find the right approach. +- **astropy-14365**: Only the 5-agent debate (cc.script) solves it. The NL planning in cc.nl is insufficient — the structured critique/reducer/consensus pipeline catches something the single-shot NL planner misses. + +### Finding 5: cc.nl is the best cost-performance tradeoff + +| Mode | $/resolved task | Marginal cost per extra task | +|------|----------------|------------------------------| +| **cc.r** | $0.46 | — | +| **cc.nl** | $0.81 | ~$4.30 for +1 task | +| **cc.script** | $5.89 | ~$61.68 for +1 task over cc.nl | + +cc.nl gets 55% resolve rate at ~$9 total — 8x cheaper than cc.script for only 1 fewer resolved task. The marginal cost of cc.script's extra task (astropy-14365) is ~$62. + +### Finding 6: Scaling from 5 to 20 tasks changed the picture + +| Mode | 5-task rate | 20-task rate | Delta | +|------|-----------|-------------|-------| +| cc.r | 20% (1/5) | 50% (10/20) | +30pp | +| cc.script | 40% (2/5) | 60% (12/20) | +20pp | + +The 5-task sample was pessimistic — the first 5 astropy tasks happened to be harder than average. At 20 tasks, raw mode's 50% baseline is much stronger, and the relative advantage of planning shrinks from +20pp to +10pp. + +## Limitations + +1. **Single repository** — All 20 tasks are from astropy. Results may not generalize to other Python projects or languages. +2. **cc.nl metrics estimated** — Resume bug overwrote first 14 tasks' cost/time data. Cost extrapolated from 6 measured tasks. +3. **codex.script not completed** — Codex API rate limits prevented completing the 20-task run (only 3/20 finished). Cannot compare cc.script vs codex.script at this scale. +4. **Single run** — No repeated trials to measure variance. The cc.r and cc.script re-runs (2026-03-24) are on newer model versions and may differ in resolve rates. +5. **No impl mode** — `--mode impl` was not run at 20-task scale for this report. +6. **cc.nl not re-run** — cc.nl costs were not re-measured with the corrected cost tracking; the ~$8.94 figure remains an estimate from the original run. + +## Recommendations + +1. **Use cc.nl as default** — best cost-performance ratio (55% at ~$9 vs 60% at $71). +2. **Use cc.script for high-stakes tasks** — when correctness matters more than cost, the extra ~$62/task buys +5pp. +3. **Complete codex.script evaluation** — needed to test the hypothesis cc.script ≈ codex.script. +4. **Investigate the 8 hard-fail tasks** — root-cause analysis may reveal systematic pipeline gaps. +5. **Expand to other repositories** — astropy-only evaluation limits generalizability. + +## Appendix A: Cost Efficiency + +| Mode | Cost/task | Time/task | $/second | Tokens/task | +|------|-----------|-----------|----------|-------------| +| **cc.r** | $0.23 | 102s | $0.0023 | 5,133 | +| **cc.nl** | ~$0.45 | ~157s | ~$0.003 | ~6,188 | +| **cc.script** | $3.53 | 712s | $0.005 | 7,640 | + +## Appendix B: Model-Level Cost Breakdown (2026-03-25) + +### Raw Opus Baseline + +Running raw mode with opus instead of sonnet validates cost proportionality: + +| Mode | Model | Tokens | Cost | Cost/task | Time | +|------|-------|--------|------|-----------|------| +| `--mode raw` | sonnet | 102,658 | $4.64 | $0.23 | 34 min | +| `--mode raw` | opus | 66,139 | $27.32 | $1.37 | 49 min | +| `--mode full` | opus plan + sonnet impl | 152,796 | $70.62 | $3.53 | 3.9 hrs | + +**Key insight:** Raw opus costs $27.32 — the 5.9x ratio vs raw sonnet ($4.64) matches the ~5x opus/sonnet pricing difference. Full mode's $70.62 breaks down as: raw opus baseline ($27) + planning pipeline overhead (~$43). The planning overhead is ~1.6x the raw opus cost, not the 15x it appears when comparing against raw sonnet. + +### Codex Implementation Backend — Iteration Loop Bug + +Running full mode with `--planner-backend claude:opus --impl-backend codex:gpt-5.2-codex` revealed a critical iteration bug. On task 1 (astropy-12907), the Codex implementation agent failed to produce the FSM completion marker, causing the harness to loop: + +``` +Iteration 1: impl-iter-1 (codex:gpt-5.2-codex) runs 83s → "completion marker missing" +Iteration 2: impl-iter-2 (codex:gpt-5.2-codex) runs 158s → "completion marker missing" +Iteration 3: impl-iter-3 (codex:gpt-5.2-codex) runs 81s → "completion marker missing" +Iteration 4: impl-iter-4 (codex:gpt-5.2-codex) runs 193s → "completion marker missing" +Iteration 5: (killed by user) +``` + +The Codex backend produced the correct patch in iteration 1 but never emitted the completion signal. Each subsequent iteration found "no changes to commit" yet still burned ~100-200s. This multi-iteration loop is the likely cause of unexpectedly long run times when using Codex as the implementation backend. The sonnet implementation backend consistently completes in exactly 1 iteration across all 20 tasks. + +### nlcmd Mode — Planning Timeout Analysis (2026-03-25) + +Re-running nlcmd with corrected cost tracking revealed that 7/20 tasks (35%) consistently timeout during Phase 1 (NL planning via `claude -p "/ultra-planner"`), even across two attempts with the 900s planning budget: + +| Attempt | Completed | Timed out | Cost | +|---------|-----------|-----------|------| +| Run 1 (20 tasks) | 10 | 10 | $101.22 | +| Run 2 (10 retried) | 3 | 7 | $57.85 | +| **Combined** | **13** | **7** | **$159.07** | + +The nlcmd mode is paradoxically the most expensive ($159 total) due to timeout waste — the timed-out planning phases still consume tokens and cost without producing patches. The 7 persistently-failing tasks (13236, 13398, 13977, 14369, 14508, 14598, 7671) suggest the `claude -p` NL dispatch adds overhead that pushes the 5-agent debate past the 900s budget. diff --git a/python/agentize/eval/eval_harness.md b/python/agentize/eval/eval_harness.md index 2fda3cbd..bb112749 100644 --- a/python/agentize/eval/eval_harness.md +++ b/python/agentize/eval/eval_harness.md @@ -33,7 +33,7 @@ The harness supports four execution modes via `--mode`: | Mode | What runs | What it tests | Cost tracking | |------|-----------|---------------|---------------| -| `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage | +| `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage (cache-tier aware when provided) | | `impl` | FSM orchestrator only (no planning) | The impl kernel loop | JSONL session files | | `full` | Planning pipeline + FSM orchestrator | The agentize framework | JSONL session files | | `nlcmd` | NL planning via `claude -p` + FSM | NL orchestration | JSONL session files | @@ -154,6 +154,11 @@ Run costs scale with task count and per-task timeout. Validate incrementally: ## Cost Estimation +For JSONL-based modes (`impl`, `full`, `nlcmd`), cost is tracked via session file +diffing with per-session deduplication by `message.id` to avoid counting streamed +content blocks multiple times. Raw mode uses `claude -p` JSON output with +cache-tier-aware pricing when cache token fields are present. + Per-task costs depend on the model and task complexity. Rough estimates: | Model | Tokens/task (est.) | Cost/task (est.) | 300 tasks | diff --git a/python/agentize/eval/eval_harness.py b/python/agentize/eval/eval_harness.py index 5f42c3c4..1eb85c8f 100644 --- a/python/agentize/eval/eval_harness.py +++ b/python/agentize/eval/eval_harness.py @@ -61,10 +61,18 @@ def _make_result(instance_id: str) -> dict: } -def _compute_cost(input_tokens: int, output_tokens: int, model: str) -> float: +def _compute_cost( + input_tokens: int, + output_tokens: int, + model: str, + cache_read: int = 0, + cache_write: int = 0, +) -> float: """Compute estimated USD cost from token counts and model short-name. - Falls back to 0.0 if the model is not in the pricing table. + When cache_read/cache_write are provided, applies cache-tier pricing: + non-cache input at base rate, cache reads at cache_read rate, cache + writes at cache_write rate. Falls back to 0.0 if model is unknown. """ from agentize.usage import match_model_pricing @@ -72,19 +80,26 @@ def _compute_cost(input_tokens: int, output_tokens: int, model: str) -> float: rates = match_model_pricing(model_id) if not rates: return 0.0 + non_cache = max(0, input_tokens - cache_read - cache_write) return ( - input_tokens * rates["input"] / 1_000_000 + non_cache * rates["input"] / 1_000_000 + output_tokens * rates["output"] / 1_000_000 + + cache_read * rates["cache_read"] / 1_000_000 + + cache_write * rates["cache_write"] / 1_000_000 ) def _parse_claude_usage(stdout: str, model: str) -> dict: """Parse ``claude -p --output-format json`` output for token/cost data. - Returns a dict with keys: input_tokens, output_tokens, tokens, cost_usd. + Returns a dict with keys: input_tokens, output_tokens, tokens, + cache_read_tokens, cache_write_tokens, cost_usd. Returns zeroes on parse failure. """ - result = {"input_tokens": 0, "output_tokens": 0, "tokens": 0, "cost_usd": 0.0} + result = { + "input_tokens": 0, "output_tokens": 0, "tokens": 0, + "cache_read_tokens": 0, "cache_write_tokens": 0, "cost_usd": 0.0, + } if not stdout: return result try: @@ -92,12 +107,19 @@ def _parse_claude_usage(stdout: str, model: str) -> dict: usage = data.get("usage", {}) inp = usage.get("input_tokens", 0) out = usage.get("output_tokens", 0) + cache_read = usage.get("cache_read_input_tokens", 0) + cache_write = usage.get("cache_creation_input_tokens", 0) result["input_tokens"] = inp result["output_tokens"] = out result["tokens"] = inp + out + result["cache_read_tokens"] = cache_read + result["cache_write_tokens"] = cache_write # Use model from JSON if available, else fall back to caller's model json_model = data.get("model", "") - result["cost_usd"] = _compute_cost(inp, out, json_model or model) + result["cost_usd"] = _compute_cost( + inp, out, json_model or model, + cache_read=cache_read, cache_write=cache_write, + ) except (json.JSONDecodeError, KeyError): pass return result @@ -130,6 +152,7 @@ def _sum_jsonl_usage(paths: list[str]) -> dict: "tokens": 0, "cost_usd": 0.0, } for path in paths: + seen_msg_ids: set[str] = set() # dedup per file try: with open(path, "r", encoding="utf-8", errors="ignore") as f: for line in f: @@ -140,6 +163,13 @@ def _sum_jsonl_usage(paths: list[str]) -> dict: entry = json.loads(line) if entry.get("type") == "assistant": message = entry.get("message", {}) + # Deduplicate by message.id (streaming produces + # multiple JSONL lines per API response) + msg_id = message.get("id", "") + if msg_id: + if msg_id in seen_msg_ids: + continue + seen_msg_ids.add(msg_id) usage = message.get("usage", {}) inp = usage.get("input_tokens", 0) out = usage.get("output_tokens", 0) @@ -1227,8 +1257,9 @@ def main(argv: list[str] | None = None) -> int: def _cmd_run(args) -> int: """Execute the ``run`` subcommand.""" - # Auto-append mode to output dir so raw/full don't overwrite each other - base_dir = Path(args.output_dir) + # Auto-append mode to output dir so raw/full don't overwrite each other. + # Resolve to absolute paths because full mode os.chdir()s into worktrees. + base_dir = Path(args.output_dir).resolve() output_dir = base_dir / args.mode # Share repo cache across modes (clones are expensive) repos_dir = base_dir / "repos" diff --git a/python/agentize/usage.md b/python/agentize/usage.md index be31a904..b0c1cf2d 100644 --- a/python/agentize/usage.md +++ b/python/agentize/usage.md @@ -47,6 +47,8 @@ Dict mapping bucket keys to stats: - Scans `~/.claude/projects/**/*.jsonl` files - Filters by modification time (24h for today, 7d for week) - Extracts `input_tokens` and `output_tokens` from assistant messages +- Deduplicates assistant entries that share the same `message.id` within a session file +- Cost estimation applies cache_read/cache_write tiers when present - Counts unique sessions (one JSONL file = one session) - Returns empty buckets if `~/.claude/projects` doesn't exist - Cache tokens: Extracts `cache_read_input_tokens` and `cache_creation_input_tokens` when `include_cache=True` diff --git a/python/agentize/usage.py b/python/agentize/usage.py index 55277c26..9197ac67 100644 --- a/python/agentize/usage.py +++ b/python/agentize/usage.py @@ -127,6 +127,7 @@ def make_bucket(): # Parse JSONL file line by line for memory efficiency file_has_usage = False + seen_msg_ids: set[str] = set() # dedup per file with open(jsonl_path, "r", encoding="utf-8", errors="ignore") as f: for line in f: line = line.strip() @@ -137,6 +138,13 @@ def make_bucket(): # Extract usage from assistant messages if entry.get("type") == "assistant": message = entry.get("message", {}) + # Deduplicate by message.id (streaming produces + # multiple JSONL lines per API response) + msg_id = message.get("id", "") + if msg_id: + if msg_id in seen_msg_ids: + continue + seen_msg_ids.add(msg_id) usage = message.get("usage", {}) input_tokens = usage.get("input_tokens", 0) output_tokens = usage.get("output_tokens", 0) diff --git a/python/tests/test_eval_harness.md b/python/tests/test_eval_harness.md new file mode 100644 index 00000000..a39fb7d5 --- /dev/null +++ b/python/tests/test_eval_harness.md @@ -0,0 +1,19 @@ +# test_eval_harness.py + +Tests for `agentize.eval.eval_harness` pure functions. + +## Scope + +- `_compute_cost`: Cache-tier-aware pricing with `cache_read` and `cache_write` parameters +- `_parse_claude_usage`: Cache token extraction from `claude -p` JSON output +- `_sum_jsonl_usage`: Deduplication by `message.id` to avoid counting streamed content blocks +- `aggregate_metrics`: Cost aggregation across task results +- `extract_patch`: Git diff extraction from worktrees +- `write_overrides`: Shell override generation for eval isolation + +## Test Data + +Tests use inline JSON strings and `tmp_path` fixtures for JSONL files with: +- Duplicate `message.id` entries (simulating streaming content blocks) +- Cache token fields (`cache_read_input_tokens`, `cache_creation_input_tokens`) +- Mixed entries with and without `message.id` diff --git a/python/tests/test_eval_harness.py b/python/tests/test_eval_harness.py index f4b9bf4b..2d76e28e 100644 --- a/python/tests/test_eval_harness.py +++ b/python/tests/test_eval_harness.py @@ -365,6 +365,51 @@ def test_uses_json_model_for_cost(self): ) assert usage_opus["cost_usd"] > usage_sonnet["cost_usd"] + def test_extracts_cache_tokens(self): + """Should extract cache_read and cache_write token fields.""" + stdout = json.dumps({ + "model": "claude-sonnet-4-20260514", + "usage": { + "input_tokens": 1000, + "output_tokens": 500, + "cache_read_input_tokens": 800, + "cache_creation_input_tokens": 100, + }, + }) + usage = _parse_claude_usage(stdout, "sonnet") + assert usage["cache_read_tokens"] == 800 + assert usage["cache_write_tokens"] == 100 + + def test_cache_tokens_default_zero(self): + """Cache tokens should default to 0 when not present.""" + stdout = json.dumps({ + "model": "claude-sonnet-4-20260514", + "usage": {"input_tokens": 1000, "output_tokens": 500}, + }) + usage = _parse_claude_usage(stdout, "sonnet") + assert usage["cache_read_tokens"] == 0 + assert usage["cache_write_tokens"] == 0 + + def test_cache_aware_cost(self): + """Cost should use cache tiers when cache tokens are present.""" + stdout_no_cache = json.dumps({ + "model": "claude-sonnet-4-20260514", + "usage": {"input_tokens": 1000, "output_tokens": 500}, + }) + stdout_with_cache = json.dumps({ + "model": "claude-sonnet-4-20260514", + "usage": { + "input_tokens": 1000, + "output_tokens": 500, + "cache_read_input_tokens": 900, + "cache_creation_input_tokens": 0, + }, + }) + cost_no_cache = _parse_claude_usage(stdout_no_cache, "sonnet")["cost_usd"] + cost_with_cache = _parse_claude_usage(stdout_with_cache, "sonnet")["cost_usd"] + # With 900/1000 tokens as cache reads (10x cheaper), cost should be lower + assert cost_with_cache < cost_no_cache + class TestComputeCost: def test_sonnet_cost(self): @@ -390,6 +435,40 @@ def test_full_model_id(self): cost = _compute_cost(1_000_000, 0, "claude-sonnet-4-20260514") assert cost == pytest.approx(3.0) + def test_cache_read_reduces_cost(self): + """Cache reads at 0.1x should be much cheaper than base input.""" + # 1M input, 0 output, all cache reads — sonnet cache_read = $0.30/M + cost_cached = _compute_cost(1_000_000, 0, "sonnet", + cache_read=1_000_000, cache_write=0) + # All tokens are cache reads → non_cache = 0, cost = cache_read rate only + assert cost_cached == pytest.approx(0.30) + + def test_cache_write_cost(self): + """Cache writes at 1.25x should cost more than base input.""" + # 1M input, 0 output, all cache writes — sonnet cache_write = $3.75/M + cost = _compute_cost(1_000_000, 0, "sonnet", + cache_read=0, cache_write=1_000_000) + assert cost == pytest.approx(3.75) + + def test_mixed_cache_tiers(self): + """Mixed cache tiers should split cost correctly.""" + # 1M total input: 500K cache_read, 200K cache_write, 300K base + # Sonnet: base=$3/M, cache_read=$0.30/M, cache_write=$3.75/M, output=$15/M + cost = _compute_cost(1_000_000, 100_000, "sonnet", + cache_read=500_000, cache_write=200_000) + expected = ( + 300_000 * 3.0 / 1_000_000 # base input + + 100_000 * 15.0 / 1_000_000 # output + + 500_000 * 0.30 / 1_000_000 # cache_read + + 200_000 * 3.75 / 1_000_000 # cache_write + ) + assert cost == pytest.approx(expected) + + def test_no_cache_params_backward_compat(self): + """Without cache params, should behave like before (all input at base rate).""" + cost = _compute_cost(1_000_000, 0, "sonnet") + assert cost == pytest.approx(3.0) + class TestAggregateMetricsCost: def test_cost_aggregation(self): @@ -560,3 +639,84 @@ def _mock_list_jsonl(): assert result["cache_write_tokens"] == 20 assert result["tokens"] == 300 assert result["cost_usd"] == 1.50 + + +# --------------------------------------------------------------------------- +# JSONL deduplication tests +# --------------------------------------------------------------------------- + + +class TestSumJsonlUsageDedup: + def test_dedup_by_message_id(self, tmp_path): + """Duplicate entries with same message.id should be counted once.""" + jsonl = tmp_path / "session.jsonl" + lines = [ + json.dumps({"type": "assistant", "message": { + "id": "msg_001", "model": "claude-sonnet-4-20260514", + "usage": {"input_tokens": 100, "output_tokens": 50, + "cache_read_input_tokens": 80, + "cache_creation_input_tokens": 0}}}), + # Duplicate — same message.id, different content block + json.dumps({"type": "assistant", "message": { + "id": "msg_001", "model": "claude-sonnet-4-20260514", + "usage": {"input_tokens": 100, "output_tokens": 50, + "cache_read_input_tokens": 80, + "cache_creation_input_tokens": 0}}}), + ] + jsonl.write_text("\n".join(lines) + "\n") + + result = _sum_jsonl_usage([str(jsonl)]) + # Should count only once despite two lines + assert result["input_tokens"] == 100 + assert result["output_tokens"] == 50 + assert result["cache_read"] == 80 + + def test_different_message_ids_both_counted(self, tmp_path): + """Entries with different message.id should both be counted.""" + jsonl = tmp_path / "session.jsonl" + lines = [ + json.dumps({"type": "assistant", "message": { + "id": "msg_001", "model": "claude-sonnet-4-20260514", + "usage": {"input_tokens": 100, "output_tokens": 50}}}), + json.dumps({"type": "assistant", "message": { + "id": "msg_002", "model": "claude-sonnet-4-20260514", + "usage": {"input_tokens": 200, "output_tokens": 75}}}), + ] + jsonl.write_text("\n".join(lines) + "\n") + + result = _sum_jsonl_usage([str(jsonl)]) + assert result["input_tokens"] == 300 + assert result["output_tokens"] == 125 + + def test_no_message_id_still_counted(self, tmp_path): + """Entries without message.id should still be counted (no dedup).""" + jsonl = tmp_path / "session.jsonl" + lines = [ + json.dumps({"type": "assistant", "message": { + "model": "claude-sonnet-4-20260514", + "usage": {"input_tokens": 100, "output_tokens": 50}}}), + json.dumps({"type": "assistant", "message": { + "model": "claude-sonnet-4-20260514", + "usage": {"input_tokens": 200, "output_tokens": 75}}}), + ] + jsonl.write_text("\n".join(lines) + "\n") + + result = _sum_jsonl_usage([str(jsonl)]) + # No message.id → no dedup → both counted + assert result["input_tokens"] == 300 + assert result["output_tokens"] == 125 + + def test_dedup_scoped_per_file(self, tmp_path): + """Dedup sets should reset between files (same msg ID in different files = counted twice).""" + jsonl1 = tmp_path / "session1.jsonl" + jsonl2 = tmp_path / "session2.jsonl" + line = json.dumps({"type": "assistant", "message": { + "id": "msg_001", "model": "claude-sonnet-4-20260514", + "usage": {"input_tokens": 100, "output_tokens": 50}}}) + jsonl1.write_text(line + "\n") + jsonl2.write_text(line + "\n") + + result = _sum_jsonl_usage([str(jsonl1), str(jsonl2)]) + # Same msg ID but different files → counted in each file + assert result["input_tokens"] == 200 + assert result["output_tokens"] == 100 diff --git a/tests/cli/test-lol-usage.sh b/tests/cli/test-lol-usage.sh index da86ba07..eb8da409 100755 --- a/tests/cli/test-lol-usage.sh +++ b/tests/cli/test-lol-usage.sh @@ -240,4 +240,39 @@ echo "$output" | grep -q '\$' || { cleanup_dir "$TEST_HOME" +# Test 9: lol usage deduplicates entries with same message.id +TEST_HOME=$(make_temp_dir "usage-dedup") +PROJECTS_DIR="$TEST_HOME/.claude/projects" +FIXTURE_DIR="$PROJECTS_DIR/test-project" +mkdir -p "$FIXTURE_DIR" + +# Create fixture with duplicate message.id entries (simulates streaming content blocks) +# Each entry has 1000 input, 500 output — but same message.id means count once +cat > "$FIXTURE_DIR/session.jsonl" << 'EOF' +{"type":"assistant","message":{"id":"msg_001","model":"claude-3-5-sonnet-20241022","usage":{"input_tokens":1000,"output_tokens":500,"cache_read_input_tokens":800,"cache_creation_input_tokens":0}}} +{"type":"assistant","message":{"id":"msg_001","model":"claude-3-5-sonnet-20241022","usage":{"input_tokens":1000,"output_tokens":500,"cache_read_input_tokens":800,"cache_creation_input_tokens":0}}} +{"type":"assistant","message":{"id":"msg_002","model":"claude-3-5-sonnet-20241022","usage":{"input_tokens":2000,"output_tokens":1000,"cache_read_input_tokens":1500,"cache_creation_input_tokens":0}}} +EOF + +touch "$FIXTURE_DIR/session.jsonl" + +output=$(HOME="$TEST_HOME" lol usage --today 2>&1) +exit_code=$? +if [ $exit_code -ne 0 ]; then + echo "Output: $output" + cleanup_dir "$TEST_HOME" + test_fail "lol usage dedup test exited with code $exit_code" +fi + +# Total should show 3.0K input (1000+2000, not 1000+1000+2000) after dedup +# Check that Total line shows 3.0K input (deduplicated), not 4.0K (without dedup) +total_line=$(echo "$output" | grep "Total:") +if echo "$total_line" | grep -q "4.0K input"; then + echo "Output: $output" + cleanup_dir "$TEST_HOME" + test_fail "lol usage is NOT deduplicating entries with same message.id (showing 4.0K instead of 3.0K)" +fi + +cleanup_dir "$TEST_HOME" + test_pass "lol usage command works correctly"