[plan][eval] Fix cost calculation with session-oriented dedup and cache-aware pricing

# Implementation Plan: Session-Oriented Cost Dedup and Cache-Aware Raw Cost

## Consensus Summary
Skill note: external-consensus synthesis is manual because the external consensus script requires report file paths and write access to `.tmp`, which are unavailable in this read-only session. The plan combines the critique’s corrected dedup key (`message.id`) with the reducer’s minimal cache-aware raw cost fix, while keeping the bold proposal’s session-oriented dedup idea but avoiding the unnecessary Anthropic tokenizer dependency. This balances correctness (dedup + cache tiers) with minimal scope and explicitly documents limits so the 133x ratio is not over-claimed.

## Goal
Make cost calculation session-oriented and consistent across raw and JSONL-based modes by deduplicating repeated assistant entries and applying cache-tier pricing in raw mode when cache token fields are present.

**Success criteria:**
- Raw mode cost uses cache_read/cache_write tiers when available in JSON output, matching JSONL cost formula.
- JSONL usage aggregation deduplicates assistant entries by `message.id` within each session file.
- Tests cover deduplication and cache-aware cost; docs explain the behavior and limitations.

**Out of scope:**
- Integrating the Anthropic tokenizer for re-tokenization of JSONL sessions.
- Refreshing `MODEL_PRICING` without verified, current pricing sources.

**Future work decision:**
- ✅ Good to have in the future: Optional tokenizer-based re-counting for entries missing `usage` fields, gated behind an explicit flag and dependency note.

## Bug Reproduction

**Skip reason**:
- Reproduction requires real Claude Code JSONL session data and `claude` CLI execution; the current environment is read-only and lacks those artifacts.

## Codebase Analysis

**Files verified (docs/code checked by agents):**
- `python/agentize/eval/eval_harness.py` — `_compute_cost`, `_parse_claude_usage`, `_sum_jsonl_usage`, JSONL cost flow.
- `python/agentize/usage.py` — `count_usage`, cache tier pricing, JSONL parsing.
- `python/agentize/eval/eval_harness.md` — cost tracking documentation.
- `python/agentize/usage.md` — usage module interface docs.
- `docs/cli/lol.md` — `lol usage` behavior documentation.
- `docs/feat/core/ultra-planner.md` — `/ultra-planner` command reference for nlcmd templates.
- `docs/feat/core/mega-planner.md` — `/mega-planner` command reference for nlcmd templates.

**File changes:**

| File | Level | Purpose |
|------|-------|---------|
| `python/agentize/eval/eval_harness.py` | medium | Cache-aware raw cost, JSONL dedup by `message.id` |
| `python/agentize/usage.py` | medium | Session JSONL dedup by `message.id` in `count_usage` |
| `python/agentize/eval/eval_harness.md` | minor | Document raw cache tiers and JSONL dedup behavior |
| `python/agentize/usage.md` | minor | Document deduplication and cache-tier cost details |
| `docs/cli/lol.md` | minor | Clarify `lol usage` dedup and cache-aware cost estimate |
| `python/tests/test_eval_harness.py` | medium | Add cache-aware cost and JSONL dedup tests |
| `python/tests/test_eval_harness.md` (new) | medium | Document test intent and fixtures (Est: 30 LOC) |
| `tests/cli/test-lol-usage.sh` | medium | Add dedup fixture validation |
| `tests/cli/test-lol-usage.md` (new) | medium | Document CLI usage test expectations (Est: 25 LOC) |

**Modification level definitions:**
- **minor**: Cosmetic or trivial changes (comments, formatting, <10 LOC changed)
- **medium**: Moderate changes to existing logic (10-50 LOC, no interface changes)
- **major**: Significant structural changes (>50 LOC, interface changes, or new files)
- **remove**: File deletion

**Current architecture notes:**
Raw mode cost uses `_compute_cost` on JSON output that ignores cache tiers, while JSONL-based modes already split cache tiers in `_sum_jsonl_usage`. JSONL parsing in both `eval_harness.py` and `usage.py` counts every assistant line and does not deduplicate repeated `message.id` entries produced by streaming content blocks.

## Interface Design

**New interfaces:**
- None.

**Modified interfaces:**
```diff
-def _compute_cost(input_tokens: int, output_tokens: int, model: str) -> float:
+def _compute_cost(
+    input_tokens: int,
+    output_tokens: int,
+    model: str,
+    cache_read: int = 0,
+    cache_write: int = 0,
+) -> float:
```

```diff
-    return (
-        input_tokens * rates["input"] / 1_000_000
-        + output_tokens * rates["output"] / 1_000_000
-    )
+    non_cache = max(0, input_tokens - cache_read - cache_write)
+    return (
+        non_cache * rates["input"] / 1_000_000
+        + output_tokens * rates["output"] / 1_000_000
+        + cache_read * rates["cache_read"] / 1_000_000
+        + cache_write * rates["cache_write"] / 1_000_000
+    )
```

```diff
-    result = {"input_tokens": 0, "output_tokens": 0, "tokens": 0, "cost_usd": 0.0}
+    result = {
+        "input_tokens": 0,
+        "output_tokens": 0,
+        "tokens": 0,
+        "cache_read_tokens": 0,
+        "cache_write_tokens": 0,
+        "cost_usd": 0.0,
+    }
```

**Documentation changes:**
- `python/agentize/eval/eval_harness.md` — update Cost Estimation and Raw mode cost tracking behavior.
- `python/agentize/usage.md` — update `count_usage` behavior section to mention dedup and cache tiers.
- `docs/cli/lol.md` — update `lol usage` section to mention dedup and cache-tier cost estimate.
- `python/tests/test_eval_harness.md` — new test doc (follows `document-guideline`).
- `tests/cli/test-lol-usage.md` — new test doc (follows `document-guideline`).

## Documentation Planning

### High-level design docs (docs/)
- `docs/cli/lol.md` — update `lol usage` description to mention session-level dedup by `message.id` and cache-tier cost calculation.
```diff
- Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket.
+ Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket.
+ Assistant entries with the same `message.id` within a session are deduplicated to avoid double-counting streamed content blocks.
+ Cost estimates use cache-tier pricing when cache token fields are present.
```

### Folder READMEs
- None.

### Interface docs
- `python/agentize/eval/eval_harness.md` — clarify raw-mode cache-tier cost and JSONL dedup.
```diff
- | `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage |
+ | `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage (cache-tier aware when provided) |
```

```diff
- Cost is tracked via JSONL session file diffing — the same approach used
+ Cost is tracked via JSONL session file diffing — the same approach used,
+ with per-session deduplication by `message.id` to avoid streaming duplication.
```

- `python/agentize/usage.md` — update `count_usage` behavior to mention dedup and cache tiers.
```diff
- - Extracts `input_tokens` and `output_tokens` from assistant messages
+ - Extracts `input_tokens` and `output_tokens` from assistant messages
+ - Deduplicates assistant entries that share the same `message.id` within a session file
+ - Cost estimation applies cache_read/cache_write tiers when present
```

- `python/tests/test_eval_harness.md` — new doc summarizing test intent and fixtures (follow `document-guideline`).
```diff
+ # test_eval_harness.py
+ 
+ Design rationale: validate cache-tier cost and JSONL dedup logic for eval cost tracking.
+ 
+ Scope:
+ - _compute_cost cache-aware pricing
+ - _parse_claude_usage cache token extraction
+ - _sum_jsonl_usage dedup by message.id
```

- `tests/cli/test-lol-usage.md` — new doc describing CLI usage fixtures and expected totals (follow `document-guideline`).
```diff
+ # test-lol-usage.sh
+ 
+ Design rationale: ensure CLI usage aggregation deduplicates repeated assistant entries.
+ 
+ Fixtures:
+ - JSONL with duplicate message.id entries
+ - Expected totals reflect single-counted usage
```

## Test Strategy

**Test modifications:**
- `python/tests/test_eval_harness.py` — add tests for `_compute_cost` cache tiers, `_parse_claude_usage` cache fields, `_sum_jsonl_usage` dedup by `message.id`, and non-dedup fallback when `message.id` missing.
- `tests/cli/test-lol-usage.sh` — add a fixture with duplicate `message.id` and assert the `Total:` line shows single-counted input/output.

**Test data required:**
- Temporary JSONL fixtures with duplicate `message.id` and cache token fields.
- Use the existing temp HOME pattern in CLI tests.

## Implementation Steps

**Step 1: Update documentation for cost dedup and cache tiers** (Estimated: 80 LOC)  
Files:
- `docs/cli/lol.md` — update `lol usage` section wording to mention dedup and cache-tier cost.
- `python/agentize/eval/eval_harness.md` — clarify raw-mode cache-tier cost and JSONL dedup behavior.
- `python/agentize/usage.md` — document dedup behavior and cache-tier pricing in `count_usage`.
- `python/tests/test_eval_harness.md` — create new test doc.
- `tests/cli/test-lol-usage.md` — create new test doc.  
Dependencies: None  
Correspondence:  
Docs: Defines dedup semantics and cache-tier cost usage.  
Tests: Establishes expected behavior for new test cases.

**Step 2: Add tests for dedup and cache-aware cost** (Estimated: 90 LOC)  
Files:
- `python/tests/test_eval_harness.py` — add cache-tier cost tests and JSONL dedup tests with `message.id`.
- `tests/cli/test-lol-usage.sh` — add duplicate `message.id` fixture and assert totals.  
Dependencies: Step 1  
Correspondence:  
Docs: Matches updated dedup/cost descriptions.  
Tests: Adds regression coverage for cost and dedup behavior.

**Step 3: Implement cache-aware raw cost and JSONL dedup** (Estimated: 90 LOC)  
Files:
- `python/agentize/eval/eval_harness.py` — update `_compute_cost` to accept cache tiers, update `_parse_claude_usage` to read cache fields, add per-file dedup by `message.id` in `_sum_jsonl_usage`.
- `python/agentize/usage.py` — add per-file dedup by `message.id` in `count_usage` parsing loop.  
Dependencies: Step 2  
Correspondence:  
Docs: Implements behaviors documented in `python/agentize/eval/eval_harness.md`, `python/agentize/usage.md`, `docs/cli/lol.md`.  
Tests: Satisfies new cache-aware and dedup test cases.

**Total estimated complexity:** 260 LOC (Large)  
**Recommended approach:** Single session

## Success Criteria

- [ ] Raw mode cost uses cache tiers when cache token fields are present in JSON output.
- [ ] JSONL parsing deduplicates assistant entries by `message.id` within a session file.
- [ ] Tests for dedup and cache-aware cost pass.

## Risks and Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| `message.id` missing in some JSONL entries | M | M | Dedup only when `message.id` is present; document fallback behavior. |
| Raw JSON output lacks cache fields | M | L | Default cache values to 0; document that cache-tier cost applies only when fields exist. |
| Dedup undercounts if IDs repeat across distinct messages | L | M | Scope dedup per file and only for assistant entries; add test coverage for non-id lines. |

## Dependencies

- No new dependencies; avoid Anthropic tokenizer integration in this fix.

Dude, carefully read my response to determine what to do next.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[plan][eval] Fix cost calculation with session-oriented dedup and cache-aware pricing #982

Implementation Plan: Session-Oriented Cost Dedup and Cache-Aware Raw Cost

Consensus Summary

Goal

Bug Reproduction

Codebase Analysis

Interface Design

Documentation Planning

High-level design docs (docs/)

Folder READMEs

Interface docs

Test Strategy

Implementation Steps

Success Criteria

Risks and Mitigations

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Level	Purpose
`python/agentize/eval/eval_harness.py`	medium	Cache-aware raw cost, JSONL dedup by `message.id`
`python/agentize/usage.py`	medium	Session JSONL dedup by `message.id` in `count_usage`
`python/agentize/eval/eval_harness.md`	minor	Document raw cache tiers and JSONL dedup behavior
`python/agentize/usage.md`	minor	Document deduplication and cache-tier cost details
`docs/cli/lol.md`	minor	Clarify `lol usage` dedup and cache-aware cost estimate
`python/tests/test_eval_harness.py`	medium	Add cache-aware cost and JSONL dedup tests
`python/tests/test_eval_harness.md` (new)	medium	Document test intent and fixtures (Est: 30 LOC)
`tests/cli/test-lol-usage.sh`	medium	Add dedup fixture validation
`tests/cli/test-lol-usage.md` (new)	medium	Document CLI usage test expectations (Est: 25 LOC)

Risk	Likelihood	Impact	Mitigation
`message.id` missing in some JSONL entries	M	M	Dedup only when `message.id` is present; document fallback behavior.
Raw JSON output lacks cache fields	M	L	Default cache values to 0; document that cache-tier cost applies only when fields exist.
Dedup undercounts if IDs repeat across distinct messages	L	M	Scope dedup per file and only for assistant entries; add test coverage for non-id lines.

[plan][eval] Fix cost calculation with session-oriented dedup and cache-aware pricing #982

Description

Implementation Plan: Session-Oriented Cost Dedup and Cache-Aware Raw Cost

Consensus Summary

Goal

Bug Reproduction

Codebase Analysis

Interface Design

Documentation Planning

High-level design docs (docs/)

Folder READMEs

Interface docs

Test Strategy

Implementation Steps

Success Criteria

Risks and Mitigations

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions