diff --git a/docs/cli/lol.md b/docs/cli/lol.md
index 0fac391f..86537b50 100644
--- a/docs/cli/lol.md
+++ b/docs/cli/lol.md
@@ -108,7 +108,7 @@ Report Claude Code token usage statistics.
 lol usage [--today | --week]
 ```
 
-Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket.
+Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket. Assistant entries with the same `message.id` within a session are deduplicated to avoid double-counting streamed content blocks. Cost estimates use cache-tier pricing when cache token fields are present.
 
 #### Options
 
diff --git a/python/agentize/eval/eval-report-2026-03-16-swebench-20.md b/python/agentize/eval/eval-report-2026-03-16-swebench-20.md
new file mode 100644
index 00000000..a5973847
--- /dev/null
+++ b/python/agentize/eval/eval-report-2026-03-16-swebench-20.md
@@ -0,0 +1,175 @@
+# SWE-bench 20-Task Evaluation Report
+
+**Date:** 2026-03-16 (costs corrected 2026-03-24)
+**Benchmark:** SWE-bench Verified (20 astropy tasks)
+**Modes tested:** cc.r (raw), cc.nl (nlcmd), cc.script (full with Claude consensus)
+**Scoring:** SWE-bench Docker evaluator via podman
+
+## Executive Summary
+
+We scaled the SWE-bench evaluation from 5 to 20 tasks across three orchestration modes. The hypothesis **cc.r < cc.nl < cc.script** is **partially confirmed**: script orchestration (cc.script) resolves the most tasks at 60%, followed by NL orchestration (cc.nl) at 55%, and raw (cc.r) at 50%. Planning provides a consistent +5-10pp improvement — cc.script costs **~15x more** than cc.r for a 10pp gain.
+
+## Results
+
+### Resolve Rates
+
+| Mode |  name | Resolved | Rate | Cost | Time |
+|------|-------------|----------|------|------|------|
+| `--mode raw` | **cc.r** | 10/20 | 50% | $4.64 | 34 min |
+| `--mode nlcmd` | **cc.nl** | 11/20 | 55% | ~$8.94* | ~52 min* |
+| `--mode full` | **cc.script** | 12/20 | 60% | $70.62 | 3.9 hrs |
+
+*\*cc.nl cost/time estimated from 6 measured tasks (resume overwrite lost first 14 tasks' metrics).*
+
+> **Cost correction note (2026-03-24):** Original costs ($1.47 cc.r, $195.30 cc.script) were inaccurate due to two bugs: (1) raw mode ignored cache-tier pricing, undercounting by ~3x; (2) JSONL-based modes double-counted streaming content blocks, overcounting by ~1.7x. See [#982](https://github.com/Synthesys-Lab/agentize/issues/982). Re-run with fixed cost tracking on 2026-03-24.
+
+### Per-Instance Results
+
+| Instance | cc.r | cc.nl | cc.script |
+|----------|------|-------|-----------|
+| astropy-12907 | PASS | PASS | PASS |
+| astropy-13033 | FAIL | FAIL | FAIL |
+| astropy-13236 | FAIL | FAIL | FAIL |
+| astropy-13398 | FAIL | FAIL | FAIL |
+| astropy-13453 | PASS | PASS | PASS |
+| astropy-13579 | PASS | PASS | PASS |
+| astropy-13977 | FAIL | FAIL | FAIL |
+| astropy-14096 | PASS | PASS | PASS |
+| astropy-14182 | FAIL | FAIL | FAIL |
+| astropy-14309 | PASS | PASS | PASS |
+| **astropy-14365** | FAIL | FAIL | **PASS** |
+| astropy-14369 | FAIL | FAIL | FAIL |
+| astropy-14508 | PASS | PASS | PASS |
+| astropy-14539 | PASS | PASS | PASS |
+| astropy-14598 | FAIL | FAIL | FAIL |
+| astropy-14995 | PASS | PASS | PASS |
+| **astropy-7166** | FAIL | **PASS** | **PASS** |
+| astropy-7336 | PASS | PASS | PASS |
+| astropy-7606 | FAIL | FAIL | FAIL |
+| astropy-7671 | PASS | PASS | PASS |
+
+### Token Usage & Timing
+
+| Mode | Tokens total | Tokens mean | Time total | Time mean |
+|------|-------------|-------------|------------|-----------|
+| **cc.r** | 102,658 | 5,133 | 2,034s (34 min) | 102s |
+| **cc.nl** | ~123,763* | ~6,188* | ~3,134s* (~52 min) | ~157s* |
+| **cc.script** | 152,796 | 7,640 | 14,241s (3.9 hrs) | 712s |
+
+## Analysis
+
+### Finding 1: cc.r < cc.nl < cc.script confirmed (barely)
+
+The ordering holds but the margins are thin:
+
+| Comparison | Resolve delta | Cost delta |
+|---|---|---|
+| cc.nl vs cc.r | +1 task (+5pp) | ~2x more ($8.94 vs $4.64) |
+| cc.script vs cc.nl | +1 task (+5pp) | ~8x more ($70.62 vs $8.94) |
+| cc.script vs cc.r | +2 tasks (+10pp) | **~15x more** ($70.62 vs $4.64) |
+
+Each step up in orchestration complexity solves exactly one additional task. The cost scaling is superlinear — diminishing returns at increasing cost, though the ~15x ratio is much more reasonable than originally reported.
+
+### Finding 2: A core of 10 tasks is "easy" — all modes solve them
+
+All three modes agree on 10 tasks (50%): astropy-12907, 13453, 13579, 14096, 14309, 14508, 14539, 14995, 7336, 7671. These are likely straightforward bugs where even raw `claude -p` can produce the right fix.
+
+### Finding 3: A core of 8 tasks is "hard" — no mode solves them
+
+All three modes fail on 8 tasks (40%): astropy-13033, 13236, 13398, 13977, 14182, 14369, 14598, 7606. These are beyond what current orchestration can address, regardless of planning sophistication.
+
+### Finding 4: The interesting zone is 2 tasks (10%)
+
+Only 2 tasks differentiate the modes:
+
+| Task | cc.r | cc.nl | cc.script | What helps |
+|------|------|-------|-----------|------------|
+| **astropy-7166** | FAIL | PASS | PASS | Any planning helps |
+| **astropy-14365** | FAIL | FAIL | PASS | Only script planning helps |
+
+- **astropy-7166**: Both planning modes solve it, raw doesn't. Planning provides enough context to find the right approach.
+- **astropy-14365**: Only the 5-agent debate (cc.script) solves it. The NL planning in cc.nl is insufficient — the structured critique/reducer/consensus pipeline catches something the single-shot NL planner misses.
+
+### Finding 5: cc.nl is the best cost-performance tradeoff
+
+| Mode | $/resolved task | Marginal cost per extra task |
+|------|----------------|------------------------------|
+| **cc.r** | $0.46 | — |
+| **cc.nl** | $0.81 | ~$4.30 for +1 task |
+| **cc.script** | $5.89 | ~$61.68 for +1 task over cc.nl |
+
+cc.nl gets 55% resolve rate at ~$9 total — 8x cheaper than cc.script for only 1 fewer resolved task. The marginal cost of cc.script's extra task (astropy-14365) is ~$62.
+
+### Finding 6: Scaling from 5 to 20 tasks changed the picture
+
+| Mode | 5-task rate | 20-task rate | Delta |
+|------|-----------|-------------|-------|
+| cc.r | 20% (1/5) | 50% (10/20) | +30pp |
+| cc.script | 40% (2/5) | 60% (12/20) | +20pp |
+
+The 5-task sample was pessimistic — the first 5 astropy tasks happened to be harder than average. At 20 tasks, raw mode's 50% baseline is much stronger, and the relative advantage of planning shrinks from +20pp to +10pp.
+
+## Limitations
+
+1. **Single repository** — All 20 tasks are from astropy. Results may not generalize to other Python projects or languages.
+2. **cc.nl metrics estimated** — Resume bug overwrote first 14 tasks' cost/time data. Cost extrapolated from 6 measured tasks.
+3. **codex.script not completed** — Codex API rate limits prevented completing the 20-task run (only 3/20 finished). Cannot compare cc.script vs codex.script at this scale.
+4. **Single run** — No repeated trials to measure variance. The cc.r and cc.script re-runs (2026-03-24) are on newer model versions and may differ in resolve rates.
+5. **No impl mode** — `--mode impl` was not run at 20-task scale for this report.
+6. **cc.nl not re-run** — cc.nl costs were not re-measured with the corrected cost tracking; the ~$8.94 figure remains an estimate from the original run.
+
+## Recommendations
+
+1. **Use cc.nl as default** — best cost-performance ratio (55% at ~$9 vs 60% at $71).
+2. **Use cc.script for high-stakes tasks** — when correctness matters more than cost, the extra ~$62/task buys +5pp.
+3. **Complete codex.script evaluation** — needed to test the hypothesis cc.script ≈ codex.script.
+4. **Investigate the 8 hard-fail tasks** — root-cause analysis may reveal systematic pipeline gaps.
+5. **Expand to other repositories** — astropy-only evaluation limits generalizability.
+
+## Appendix A: Cost Efficiency
+
+| Mode | Cost/task | Time/task | $/second | Tokens/task |
+|------|-----------|-----------|----------|-------------|
+| **cc.r** | $0.23 | 102s | $0.0023 | 5,133 |
+| **cc.nl** | ~$0.45 | ~157s | ~$0.003 | ~6,188 |
+| **cc.script** | $3.53 | 712s | $0.005 | 7,640 |
+
+## Appendix B: Model-Level Cost Breakdown (2026-03-25)
+
+### Raw Opus Baseline
+
+Running raw mode with opus instead of sonnet validates cost proportionality:
+
+| Mode | Model | Tokens | Cost | Cost/task | Time |
+|------|-------|--------|------|-----------|------|
+| `--mode raw` | sonnet | 102,658 | $4.64 | $0.23 | 34 min |
+| `--mode raw` | opus | 66,139 | $27.32 | $1.37 | 49 min |
+| `--mode full` | opus plan + sonnet impl | 152,796 | $70.62 | $3.53 | 3.9 hrs |
+
+**Key insight:** Raw opus costs $27.32 — the 5.9x ratio vs raw sonnet ($4.64) matches the ~5x opus/sonnet pricing difference. Full mode's $70.62 breaks down as: raw opus baseline ($27) + planning pipeline overhead (~$43). The planning overhead is ~1.6x the raw opus cost, not the 15x it appears when comparing against raw sonnet.
+
+### Codex Implementation Backend — Iteration Loop Bug
+
+Running full mode with `--planner-backend claude:opus --impl-backend codex:gpt-5.2-codex` revealed a critical iteration bug. On task 1 (astropy-12907), the Codex implementation agent failed to produce the FSM completion marker, causing the harness to loop:
+
+```
+Iteration 1: impl-iter-1 (codex:gpt-5.2-codex) runs 83s → "completion marker missing"
+Iteration 2: impl-iter-2 (codex:gpt-5.2-codex) runs 158s → "completion marker missing"
+Iteration 3: impl-iter-3 (codex:gpt-5.2-codex) runs 81s  → "completion marker missing"
+Iteration 4: impl-iter-4 (codex:gpt-5.2-codex) runs 193s → "completion marker missing"
+Iteration 5: (killed by user)
+```
+
+The Codex backend produced the correct patch in iteration 1 but never emitted the completion signal. Each subsequent iteration found "no changes to commit" yet still burned ~100-200s. This multi-iteration loop is the likely cause of unexpectedly long run times when using Codex as the implementation backend. The sonnet implementation backend consistently completes in exactly 1 iteration across all 20 tasks.
+
+### nlcmd Mode — Planning Timeout Analysis (2026-03-25)
+
+Re-running nlcmd with corrected cost tracking revealed that 7/20 tasks (35%) consistently timeout during Phase 1 (NL planning via `claude -p "/ultra-planner"`), even across two attempts with the 900s planning budget:
+
+| Attempt | Completed | Timed out | Cost |
+|---------|-----------|-----------|------|
+| Run 1 (20 tasks) | 10 | 10 | $101.22 |
+| Run 2 (10 retried) | 3 | 7 | $57.85 |
+| **Combined** | **13** | **7** | **$159.07** |
+
+The nlcmd mode is paradoxically the most expensive ($159 total) due to timeout waste — the timed-out planning phases still consume tokens and cost without producing patches. The 7 persistently-failing tasks (13236, 13398, 13977, 14369, 14508, 14598, 7671) suggest the `claude -p` NL dispatch adds overhead that pushes the 5-agent debate past the 900s budget.
diff --git a/python/agentize/eval/eval_harness.md b/python/agentize/eval/eval_harness.md
index 2fda3cbd..bb112749 100644
--- a/python/agentize/eval/eval_harness.md
+++ b/python/agentize/eval/eval_harness.md
@@ -33,7 +33,7 @@ The harness supports four execution modes via `--mode`:
 
 | Mode | What runs | What it tests | Cost tracking |
 |------|-----------|---------------|---------------|
-| `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage |
+| `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage (cache-tier aware when provided) |
 | `impl` | FSM orchestrator only (no planning) | The impl kernel loop | JSONL session files |
 | `full` | Planning pipeline + FSM orchestrator | The agentize framework | JSONL session files |
 | `nlcmd` | NL planning via `claude -p` + FSM | NL orchestration | JSONL session files |
@@ -154,6 +154,11 @@ Run costs scale with task count and per-task timeout. Validate incrementally:
 
 ## Cost Estimation
 
+For JSONL-based modes (`impl`, `full`, `nlcmd`), cost is tracked via session file
+diffing with per-session deduplication by `message.id` to avoid counting streamed
+content blocks multiple times. Raw mode uses `claude -p` JSON output with
+cache-tier-aware pricing when cache token fields are present.
+
 Per-task costs depend on the model and task complexity. Rough estimates:
 
 | Model | Tokens/task (est.) | Cost/task (est.) | 300 tasks |
diff --git a/python/agentize/eval/eval_harness.py b/python/agentize/eval/eval_harness.py
index 5f42c3c4..1eb85c8f 100644
--- a/python/agentize/eval/eval_harness.py
+++ b/python/agentize/eval/eval_harness.py
@@ -61,10 +61,18 @@ def _make_result(instance_id: str) -> dict:
     }
 
 
-def _compute_cost(input_tokens: int, output_tokens: int, model: str) -> float:
+def _compute_cost(
+    input_tokens: int,
+    output_tokens: int,
+    model: str,
+    cache_read: int = 0,
+    cache_write: int = 0,
+) -> float:
     """Compute estimated USD cost from token counts and model short-name.
 
-    Falls back to 0.0 if the model is not in the pricing table.
+    When cache_read/cache_write are provided, applies cache-tier pricing:
+    non-cache input at base rate, cache reads at cache_read rate, cache
+    writes at cache_write rate. Falls back to 0.0 if model is unknown.
     """
     from agentize.usage import match_model_pricing
 
@@ -72,19 +80,26 @@ def _compute_cost(input_tokens: int, output_tokens: int, model: str) -> float:
     rates = match_model_pricing(model_id)
     if not rates:
         return 0.0
+    non_cache = max(0, input_tokens - cache_read - cache_write)
     return (
-        input_tokens * rates["input"] / 1_000_000
+        non_cache * rates["input"] / 1_000_000
         + output_tokens * rates["output"] / 1_000_000
+        + cache_read * rates["cache_read"] / 1_000_000
+        + cache_write * rates["cache_write"] / 1_000_000
     )
 
 
 def _parse_claude_usage(stdout: str, model: str) -> dict:
     """Parse ``claude -p --output-format json`` output for token/cost data.
 
-    Returns a dict with keys: input_tokens, output_tokens, tokens, cost_usd.
+    Returns a dict with keys: input_tokens, output_tokens, tokens,
+    cache_read_tokens, cache_write_tokens, cost_usd.
     Returns zeroes on parse failure.
     """
-    result = {"input_tokens": 0, "output_tokens": 0, "tokens": 0, "cost_usd": 0.0}
+    result = {
+        "input_tokens": 0, "output_tokens": 0, "tokens": 0,
+        "cache_read_tokens": 0, "cache_write_tokens": 0, "cost_usd": 0.0,
+    }
     if not stdout:
         return result
     try:
@@ -92,12 +107,19 @@ def _parse_claude_usage(stdout: str, model: str) -> dict:
         usage = data.get("usage", {})
         inp = usage.get("input_tokens", 0)
         out = usage.get("output_tokens", 0)
+        cache_read = usage.get("cache_read_input_tokens", 0)
+        cache_write = usage.get("cache_creation_input_tokens", 0)
         result["input_tokens"] = inp
         result["output_tokens"] = out
         result["tokens"] = inp + out
+        result["cache_read_tokens"] = cache_read
+        result["cache_write_tokens"] = cache_write
         # Use model from JSON if available, else fall back to caller's model
         json_model = data.get("model", "")
-        result["cost_usd"] = _compute_cost(inp, out, json_model or model)
+        result["cost_usd"] = _compute_cost(
+            inp, out, json_model or model,
+            cache_read=cache_read, cache_write=cache_write,
+        )
     except (json.JSONDecodeError, KeyError):
         pass
     return result
@@ -130,6 +152,7 @@ def _sum_jsonl_usage(paths: list[str]) -> dict:
         "tokens": 0, "cost_usd": 0.0,
     }
     for path in paths:
+        seen_msg_ids: set[str] = set()  # dedup per file
         try:
             with open(path, "r", encoding="utf-8", errors="ignore") as f:
                 for line in f:
@@ -140,6 +163,13 @@ def _sum_jsonl_usage(paths: list[str]) -> dict:
                         entry = json.loads(line)
                         if entry.get("type") == "assistant":
                             message = entry.get("message", {})
+                            # Deduplicate by message.id (streaming produces
+                            # multiple JSONL lines per API response)
+                            msg_id = message.get("id", "")
+                            if msg_id:
+                                if msg_id in seen_msg_ids:
+                                    continue
+                                seen_msg_ids.add(msg_id)
                             usage = message.get("usage", {})
                             inp = usage.get("input_tokens", 0)
                             out = usage.get("output_tokens", 0)
@@ -1227,8 +1257,9 @@ def main(argv: list[str] | None = None) -> int:
 
 def _cmd_run(args) -> int:
     """Execute the ``run`` subcommand."""
-    # Auto-append mode to output dir so raw/full don't overwrite each other
-    base_dir = Path(args.output_dir)
+    # Auto-append mode to output dir so raw/full don't overwrite each other.
+    # Resolve to absolute paths because full mode os.chdir()s into worktrees.
+    base_dir = Path(args.output_dir).resolve()
     output_dir = base_dir / args.mode
     # Share repo cache across modes (clones are expensive)
     repos_dir = base_dir / "repos"
diff --git a/python/agentize/usage.md b/python/agentize/usage.md
index be31a904..b0c1cf2d 100644
--- a/python/agentize/usage.md
+++ b/python/agentize/usage.md
@@ -47,6 +47,8 @@ Dict mapping bucket keys to stats:
 - Scans `~/.claude/projects/**/*.jsonl` files
 - Filters by modification time (24h for today, 7d for week)
 - Extracts `input_tokens` and `output_tokens` from assistant messages
+- Deduplicates assistant entries that share the same `message.id` within a session file
+- Cost estimation applies cache_read/cache_write tiers when present
 - Counts unique sessions (one JSONL file = one session)
 - Returns empty buckets if `~/.claude/projects` doesn't exist
 - Cache tokens: Extracts `cache_read_input_tokens` and `cache_creation_input_tokens` when `include_cache=True`
diff --git a/python/agentize/usage.py b/python/agentize/usage.py
index 55277c26..9197ac67 100644
--- a/python/agentize/usage.py
+++ b/python/agentize/usage.py
@@ -127,6 +127,7 @@ def make_bucket():
 
             # Parse JSONL file line by line for memory efficiency
             file_has_usage = False
+            seen_msg_ids: set[str] = set()  # dedup per file
             with open(jsonl_path, "r", encoding="utf-8", errors="ignore") as f:
                 for line in f:
                     line = line.strip()
@@ -137,6 +138,13 @@ def make_bucket():
                         # Extract usage from assistant messages
                         if entry.get("type") == "assistant":
                             message = entry.get("message", {})
+                            # Deduplicate by message.id (streaming produces
+                            # multiple JSONL lines per API response)
+                            msg_id = message.get("id", "")
+                            if msg_id:
+                                if msg_id in seen_msg_ids:
+                                    continue
+                                seen_msg_ids.add(msg_id)
                             usage = message.get("usage", {})
                             input_tokens = usage.get("input_tokens", 0)
                             output_tokens = usage.get("output_tokens", 0)
diff --git a/python/tests/test_eval_harness.md b/python/tests/test_eval_harness.md
new file mode 100644
index 00000000..a39fb7d5
--- /dev/null
+++ b/python/tests/test_eval_harness.md
@@ -0,0 +1,19 @@
+# test_eval_harness.py
+
+Tests for `agentize.eval.eval_harness` pure functions.
+
+## Scope
+
+- `_compute_cost`: Cache-tier-aware pricing with `cache_read` and `cache_write` parameters
+- `_parse_claude_usage`: Cache token extraction from `claude -p` JSON output
+- `_sum_jsonl_usage`: Deduplication by `message.id` to avoid counting streamed content blocks
+- `aggregate_metrics`: Cost aggregation across task results
+- `extract_patch`: Git diff extraction from worktrees
+- `write_overrides`: Shell override generation for eval isolation
+
+## Test Data
+
+Tests use inline JSON strings and `tmp_path` fixtures for JSONL files with:
+- Duplicate `message.id` entries (simulating streaming content blocks)
+- Cache token fields (`cache_read_input_tokens`, `cache_creation_input_tokens`)
+- Mixed entries with and without `message.id`
diff --git a/python/tests/test_eval_harness.py b/python/tests/test_eval_harness.py
index f4b9bf4b..2d76e28e 100644
--- a/python/tests/test_eval_harness.py
+++ b/python/tests/test_eval_harness.py
@@ -365,6 +365,51 @@ def test_uses_json_model_for_cost(self):
         )
         assert usage_opus["cost_usd"] > usage_sonnet["cost_usd"]
 
+    def test_extracts_cache_tokens(self):
+        """Should extract cache_read and cache_write token fields."""
+        stdout = json.dumps({
+            "model": "claude-sonnet-4-20260514",
+            "usage": {
+                "input_tokens": 1000,
+                "output_tokens": 500,
+                "cache_read_input_tokens": 800,
+                "cache_creation_input_tokens": 100,
+            },
+        })
+        usage = _parse_claude_usage(stdout, "sonnet")
+        assert usage["cache_read_tokens"] == 800
+        assert usage["cache_write_tokens"] == 100
+
+    def test_cache_tokens_default_zero(self):
+        """Cache tokens should default to 0 when not present."""
+        stdout = json.dumps({
+            "model": "claude-sonnet-4-20260514",
+            "usage": {"input_tokens": 1000, "output_tokens": 500},
+        })
+        usage = _parse_claude_usage(stdout, "sonnet")
+        assert usage["cache_read_tokens"] == 0
+        assert usage["cache_write_tokens"] == 0
+
+    def test_cache_aware_cost(self):
+        """Cost should use cache tiers when cache tokens are present."""
+        stdout_no_cache = json.dumps({
+            "model": "claude-sonnet-4-20260514",
+            "usage": {"input_tokens": 1000, "output_tokens": 500},
+        })
+        stdout_with_cache = json.dumps({
+            "model": "claude-sonnet-4-20260514",
+            "usage": {
+                "input_tokens": 1000,
+                "output_tokens": 500,
+                "cache_read_input_tokens": 900,
+                "cache_creation_input_tokens": 0,
+            },
+        })
+        cost_no_cache = _parse_claude_usage(stdout_no_cache, "sonnet")["cost_usd"]
+        cost_with_cache = _parse_claude_usage(stdout_with_cache, "sonnet")["cost_usd"]
+        # With 900/1000 tokens as cache reads (10x cheaper), cost should be lower
+        assert cost_with_cache < cost_no_cache
+
 
 class TestComputeCost:
     def test_sonnet_cost(self):
@@ -390,6 +435,40 @@ def test_full_model_id(self):
         cost = _compute_cost(1_000_000, 0, "claude-sonnet-4-20260514")
         assert cost == pytest.approx(3.0)
 
+    def test_cache_read_reduces_cost(self):
+        """Cache reads at 0.1x should be much cheaper than base input."""
+        # 1M input, 0 output, all cache reads — sonnet cache_read = $0.30/M
+        cost_cached = _compute_cost(1_000_000, 0, "sonnet",
+                                    cache_read=1_000_000, cache_write=0)
+        # All tokens are cache reads → non_cache = 0, cost = cache_read rate only
+        assert cost_cached == pytest.approx(0.30)
+
+    def test_cache_write_cost(self):
+        """Cache writes at 1.25x should cost more than base input."""
+        # 1M input, 0 output, all cache writes — sonnet cache_write = $3.75/M
+        cost = _compute_cost(1_000_000, 0, "sonnet",
+                             cache_read=0, cache_write=1_000_000)
+        assert cost == pytest.approx(3.75)
+
+    def test_mixed_cache_tiers(self):
+        """Mixed cache tiers should split cost correctly."""
+        # 1M total input: 500K cache_read, 200K cache_write, 300K base
+        # Sonnet: base=$3/M, cache_read=$0.30/M, cache_write=$3.75/M, output=$15/M
+        cost = _compute_cost(1_000_000, 100_000, "sonnet",
+                             cache_read=500_000, cache_write=200_000)
+        expected = (
+            300_000 * 3.0 / 1_000_000      # base input
+            + 100_000 * 15.0 / 1_000_000   # output
+            + 500_000 * 0.30 / 1_000_000   # cache_read
+            + 200_000 * 3.75 / 1_000_000   # cache_write
+        )
+        assert cost == pytest.approx(expected)
+
+    def test_no_cache_params_backward_compat(self):
+        """Without cache params, should behave like before (all input at base rate)."""
+        cost = _compute_cost(1_000_000, 0, "sonnet")
+        assert cost == pytest.approx(3.0)
+
 
 class TestAggregateMetricsCost:
     def test_cost_aggregation(self):
@@ -560,3 +639,84 @@ def _mock_list_jsonl():
         assert result["cache_write_tokens"] == 20
         assert result["tokens"] == 300
         assert result["cost_usd"] == 1.50
+
+
+# ---------------------------------------------------------------------------
+# JSONL deduplication tests
+# ---------------------------------------------------------------------------
+
+
+class TestSumJsonlUsageDedup:
+    def test_dedup_by_message_id(self, tmp_path):
+        """Duplicate entries with same message.id should be counted once."""
+        jsonl = tmp_path / "session.jsonl"
+        lines = [
+            json.dumps({"type": "assistant", "message": {
+                "id": "msg_001", "model": "claude-sonnet-4-20260514",
+                "usage": {"input_tokens": 100, "output_tokens": 50,
+                          "cache_read_input_tokens": 80,
+                          "cache_creation_input_tokens": 0}}}),
+            # Duplicate — same message.id, different content block
+            json.dumps({"type": "assistant", "message": {
+                "id": "msg_001", "model": "claude-sonnet-4-20260514",
+                "usage": {"input_tokens": 100, "output_tokens": 50,
+                          "cache_read_input_tokens": 80,
+                          "cache_creation_input_tokens": 0}}}),
+        ]
+        jsonl.write_text("\n".join(lines) + "\n")
+
+        result = _sum_jsonl_usage([str(jsonl)])
+        # Should count only once despite two lines
+        assert result["input_tokens"] == 100
+        assert result["output_tokens"] == 50
+        assert result["cache_read"] == 80
+
+    def test_different_message_ids_both_counted(self, tmp_path):
+        """Entries with different message.id should both be counted."""
+        jsonl = tmp_path / "session.jsonl"
+        lines = [
+            json.dumps({"type": "assistant", "message": {
+                "id": "msg_001", "model": "claude-sonnet-4-20260514",
+                "usage": {"input_tokens": 100, "output_tokens": 50}}}),
+            json.dumps({"type": "assistant", "message": {
+                "id": "msg_002", "model": "claude-sonnet-4-20260514",
+                "usage": {"input_tokens": 200, "output_tokens": 75}}}),
+        ]
+        jsonl.write_text("\n".join(lines) + "\n")
+
+        result = _sum_jsonl_usage([str(jsonl)])
+        assert result["input_tokens"] == 300
+        assert result["output_tokens"] == 125
+
+    def test_no_message_id_still_counted(self, tmp_path):
+        """Entries without message.id should still be counted (no dedup)."""
+        jsonl = tmp_path / "session.jsonl"
+        lines = [
+            json.dumps({"type": "assistant", "message": {
+                "model": "claude-sonnet-4-20260514",
+                "usage": {"input_tokens": 100, "output_tokens": 50}}}),
+            json.dumps({"type": "assistant", "message": {
+                "model": "claude-sonnet-4-20260514",
+                "usage": {"input_tokens": 200, "output_tokens": 75}}}),
+        ]
+        jsonl.write_text("\n".join(lines) + "\n")
+
+        result = _sum_jsonl_usage([str(jsonl)])
+        # No message.id → no dedup → both counted
+        assert result["input_tokens"] == 300
+        assert result["output_tokens"] == 125
+
+    def test_dedup_scoped_per_file(self, tmp_path):
+        """Dedup sets should reset between files (same msg ID in different files = counted twice)."""
+        jsonl1 = tmp_path / "session1.jsonl"
+        jsonl2 = tmp_path / "session2.jsonl"
+        line = json.dumps({"type": "assistant", "message": {
+            "id": "msg_001", "model": "claude-sonnet-4-20260514",
+            "usage": {"input_tokens": 100, "output_tokens": 50}}})
+        jsonl1.write_text(line + "\n")
+        jsonl2.write_text(line + "\n")
+
+        result = _sum_jsonl_usage([str(jsonl1), str(jsonl2)])
+        # Same msg ID but different files → counted in each file
+        assert result["input_tokens"] == 200
+        assert result["output_tokens"] == 100
diff --git a/tests/cli/test-lol-usage.sh b/tests/cli/test-lol-usage.sh
index da86ba07..eb8da409 100755
--- a/tests/cli/test-lol-usage.sh
+++ b/tests/cli/test-lol-usage.sh
@@ -240,4 +240,39 @@ echo "$output" | grep -q '\$' || {
 
 cleanup_dir "$TEST_HOME"
 
+# Test 9: lol usage deduplicates entries with same message.id
+TEST_HOME=$(make_temp_dir "usage-dedup")
+PROJECTS_DIR="$TEST_HOME/.claude/projects"
+FIXTURE_DIR="$PROJECTS_DIR/test-project"
+mkdir -p "$FIXTURE_DIR"
+
+# Create fixture with duplicate message.id entries (simulates streaming content blocks)
+# Each entry has 1000 input, 500 output — but same message.id means count once
+cat > "$FIXTURE_DIR/session.jsonl" << 'EOF'
+{"type":"assistant","message":{"id":"msg_001","model":"claude-3-5-sonnet-20241022","usage":{"input_tokens":1000,"output_tokens":500,"cache_read_input_tokens":800,"cache_creation_input_tokens":0}}}
+{"type":"assistant","message":{"id":"msg_001","model":"claude-3-5-sonnet-20241022","usage":{"input_tokens":1000,"output_tokens":500,"cache_read_input_tokens":800,"cache_creation_input_tokens":0}}}
+{"type":"assistant","message":{"id":"msg_002","model":"claude-3-5-sonnet-20241022","usage":{"input_tokens":2000,"output_tokens":1000,"cache_read_input_tokens":1500,"cache_creation_input_tokens":0}}}
+EOF
+
+touch "$FIXTURE_DIR/session.jsonl"
+
+output=$(HOME="$TEST_HOME" lol usage --today 2>&1)
+exit_code=$?
+if [ $exit_code -ne 0 ]; then
+  echo "Output: $output"
+  cleanup_dir "$TEST_HOME"
+  test_fail "lol usage dedup test exited with code $exit_code"
+fi
+
+# Total should show 3.0K input (1000+2000, not 1000+1000+2000) after dedup
+# Check that Total line shows 3.0K input (deduplicated), not 4.0K (without dedup)
+total_line=$(echo "$output" | grep "Total:")
+if echo "$total_line" | grep -q "4.0K input"; then
+  echo "Output: $output"
+  cleanup_dir "$TEST_HOME"
+  test_fail "lol usage is NOT deduplicating entries with same message.id (showing 4.0K instead of 3.0K)"
+fi
+
+cleanup_dir "$TEST_HOME"
+
 test_pass "lol usage command works correctly"