Skip to content

[plan][eval] Fix cost calculation with session-oriented dedup and cache-aware pricing #982

@ayazhankadessova

Description

@ayazhankadessova

Implementation Plan: Session-Oriented Cost Dedup and Cache-Aware Raw Cost

Consensus Summary

Skill note: external-consensus synthesis is manual because the external consensus script requires report file paths and write access to .tmp, which are unavailable in this read-only session. The plan combines the critique’s corrected dedup key (message.id) with the reducer’s minimal cache-aware raw cost fix, while keeping the bold proposal’s session-oriented dedup idea but avoiding the unnecessary Anthropic tokenizer dependency. This balances correctness (dedup + cache tiers) with minimal scope and explicitly documents limits so the 133x ratio is not over-claimed.

Goal

Make cost calculation session-oriented and consistent across raw and JSONL-based modes by deduplicating repeated assistant entries and applying cache-tier pricing in raw mode when cache token fields are present.

Success criteria:

  • Raw mode cost uses cache_read/cache_write tiers when available in JSON output, matching JSONL cost formula.
  • JSONL usage aggregation deduplicates assistant entries by message.id within each session file.
  • Tests cover deduplication and cache-aware cost; docs explain the behavior and limitations.

Out of scope:

  • Integrating the Anthropic tokenizer for re-tokenization of JSONL sessions.
  • Refreshing MODEL_PRICING without verified, current pricing sources.

Future work decision:

  • ✅ Good to have in the future: Optional tokenizer-based re-counting for entries missing usage fields, gated behind an explicit flag and dependency note.

Bug Reproduction

Skip reason:

  • Reproduction requires real Claude Code JSONL session data and claude CLI execution; the current environment is read-only and lacks those artifacts.

Codebase Analysis

Files verified (docs/code checked by agents):

  • python/agentize/eval/eval_harness.py_compute_cost, _parse_claude_usage, _sum_jsonl_usage, JSONL cost flow.
  • python/agentize/usage.pycount_usage, cache tier pricing, JSONL parsing.
  • python/agentize/eval/eval_harness.md — cost tracking documentation.
  • python/agentize/usage.md — usage module interface docs.
  • docs/cli/lol.mdlol usage behavior documentation.
  • docs/feat/core/ultra-planner.md/ultra-planner command reference for nlcmd templates.
  • docs/feat/core/mega-planner.md/mega-planner command reference for nlcmd templates.

File changes:

File Level Purpose
python/agentize/eval/eval_harness.py medium Cache-aware raw cost, JSONL dedup by message.id
python/agentize/usage.py medium Session JSONL dedup by message.id in count_usage
python/agentize/eval/eval_harness.md minor Document raw cache tiers and JSONL dedup behavior
python/agentize/usage.md minor Document deduplication and cache-tier cost details
docs/cli/lol.md minor Clarify lol usage dedup and cache-aware cost estimate
python/tests/test_eval_harness.py medium Add cache-aware cost and JSONL dedup tests
python/tests/test_eval_harness.md (new) medium Document test intent and fixtures (Est: 30 LOC)
tests/cli/test-lol-usage.sh medium Add dedup fixture validation
tests/cli/test-lol-usage.md (new) medium Document CLI usage test expectations (Est: 25 LOC)

Modification level definitions:

  • minor: Cosmetic or trivial changes (comments, formatting, <10 LOC changed)
  • medium: Moderate changes to existing logic (10-50 LOC, no interface changes)
  • major: Significant structural changes (>50 LOC, interface changes, or new files)
  • remove: File deletion

Current architecture notes:
Raw mode cost uses _compute_cost on JSON output that ignores cache tiers, while JSONL-based modes already split cache tiers in _sum_jsonl_usage. JSONL parsing in both eval_harness.py and usage.py counts every assistant line and does not deduplicate repeated message.id entries produced by streaming content blocks.

Interface Design

New interfaces:

  • None.

Modified interfaces:

-def _compute_cost(input_tokens: int, output_tokens: int, model: str) -> float:
+def _compute_cost(
+    input_tokens: int,
+    output_tokens: int,
+    model: str,
+    cache_read: int = 0,
+    cache_write: int = 0,
+) -> float:
-    return (
-        input_tokens * rates["input"] / 1_000_000
-        + output_tokens * rates["output"] / 1_000_000
-    )
+    non_cache = max(0, input_tokens - cache_read - cache_write)
+    return (
+        non_cache * rates["input"] / 1_000_000
+        + output_tokens * rates["output"] / 1_000_000
+        + cache_read * rates["cache_read"] / 1_000_000
+        + cache_write * rates["cache_write"] / 1_000_000
+    )
-    result = {"input_tokens": 0, "output_tokens": 0, "tokens": 0, "cost_usd": 0.0}
+    result = {
+        "input_tokens": 0,
+        "output_tokens": 0,
+        "tokens": 0,
+        "cache_read_tokens": 0,
+        "cache_write_tokens": 0,
+        "cost_usd": 0.0,
+    }

Documentation changes:

  • python/agentize/eval/eval_harness.md — update Cost Estimation and Raw mode cost tracking behavior.
  • python/agentize/usage.md — update count_usage behavior section to mention dedup and cache tiers.
  • docs/cli/lol.md — update lol usage section to mention dedup and cache-tier cost estimate.
  • python/tests/test_eval_harness.md — new test doc (follows document-guideline).
  • tests/cli/test-lol-usage.md — new test doc (follows document-guideline).

Documentation Planning

High-level design docs (docs/)

  • docs/cli/lol.md — update lol usage description to mention session-level dedup by message.id and cache-tier cost calculation.
- Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket.
+ Parses JSONL files from `~/.claude/projects/**/*.jsonl` to extract and aggregate token usage statistics by time bucket.
+ Assistant entries with the same `message.id` within a session are deduplicated to avoid double-counting streamed content blocks.
+ Cost estimates use cache-tier pricing when cache token fields are present.

Folder READMEs

  • None.

Interface docs

  • python/agentize/eval/eval_harness.md — clarify raw-mode cache-tier cost and JSONL dedup.
- | `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage |
+ | `raw` | `claude -p` + bare bug report | The model alone (baseline) | Claude JSON usage (cache-tier aware when provided) |
- Cost is tracked via JSONL session file diffing — the same approach used
+ Cost is tracked via JSONL session file diffing — the same approach used,
+ with per-session deduplication by `message.id` to avoid streaming duplication.
  • python/agentize/usage.md — update count_usage behavior to mention dedup and cache tiers.
- - Extracts `input_tokens` and `output_tokens` from assistant messages
+ - Extracts `input_tokens` and `output_tokens` from assistant messages
+ - Deduplicates assistant entries that share the same `message.id` within a session file
+ - Cost estimation applies cache_read/cache_write tiers when present
  • python/tests/test_eval_harness.md — new doc summarizing test intent and fixtures (follow document-guideline).
+ # test_eval_harness.py
+ 
+ Design rationale: validate cache-tier cost and JSONL dedup logic for eval cost tracking.
+ 
+ Scope:
+ - _compute_cost cache-aware pricing
+ - _parse_claude_usage cache token extraction
+ - _sum_jsonl_usage dedup by message.id
  • tests/cli/test-lol-usage.md — new doc describing CLI usage fixtures and expected totals (follow document-guideline).
+ # test-lol-usage.sh
+ 
+ Design rationale: ensure CLI usage aggregation deduplicates repeated assistant entries.
+ 
+ Fixtures:
+ - JSONL with duplicate message.id entries
+ - Expected totals reflect single-counted usage

Test Strategy

Test modifications:

  • python/tests/test_eval_harness.py — add tests for _compute_cost cache tiers, _parse_claude_usage cache fields, _sum_jsonl_usage dedup by message.id, and non-dedup fallback when message.id missing.
  • tests/cli/test-lol-usage.sh — add a fixture with duplicate message.id and assert the Total: line shows single-counted input/output.

Test data required:

  • Temporary JSONL fixtures with duplicate message.id and cache token fields.
  • Use the existing temp HOME pattern in CLI tests.

Implementation Steps

Step 1: Update documentation for cost dedup and cache tiers (Estimated: 80 LOC)
Files:

  • docs/cli/lol.md — update lol usage section wording to mention dedup and cache-tier cost.
  • python/agentize/eval/eval_harness.md — clarify raw-mode cache-tier cost and JSONL dedup behavior.
  • python/agentize/usage.md — document dedup behavior and cache-tier pricing in count_usage.
  • python/tests/test_eval_harness.md — create new test doc.
  • tests/cli/test-lol-usage.md — create new test doc.
    Dependencies: None
    Correspondence:
    Docs: Defines dedup semantics and cache-tier cost usage.
    Tests: Establishes expected behavior for new test cases.

Step 2: Add tests for dedup and cache-aware cost (Estimated: 90 LOC)
Files:

  • python/tests/test_eval_harness.py — add cache-tier cost tests and JSONL dedup tests with message.id.
  • tests/cli/test-lol-usage.sh — add duplicate message.id fixture and assert totals.
    Dependencies: Step 1
    Correspondence:
    Docs: Matches updated dedup/cost descriptions.
    Tests: Adds regression coverage for cost and dedup behavior.

Step 3: Implement cache-aware raw cost and JSONL dedup (Estimated: 90 LOC)
Files:

  • python/agentize/eval/eval_harness.py — update _compute_cost to accept cache tiers, update _parse_claude_usage to read cache fields, add per-file dedup by message.id in _sum_jsonl_usage.
  • python/agentize/usage.py — add per-file dedup by message.id in count_usage parsing loop.
    Dependencies: Step 2
    Correspondence:
    Docs: Implements behaviors documented in python/agentize/eval/eval_harness.md, python/agentize/usage.md, docs/cli/lol.md.
    Tests: Satisfies new cache-aware and dedup test cases.

Total estimated complexity: 260 LOC (Large)
Recommended approach: Single session

Success Criteria

  • Raw mode cost uses cache tiers when cache token fields are present in JSON output.
  • JSONL parsing deduplicates assistant entries by message.id within a session file.
  • Tests for dedup and cache-aware cost pass.

Risks and Mitigations

Risk Likelihood Impact Mitigation
message.id missing in some JSONL entries M M Dedup only when message.id is present; document fallback behavior.
Raw JSON output lacks cache fields M L Default cache values to 0; document that cache-tier cost applies only when fields exist.
Dedup undercounts if IDs repeat across distinct messages L M Scope dedup per file and only for assistant entries; add test coverage for non-id lines.

Dependencies

  • No new dependencies; avoid Anthropic tokenizer integration in this fix.

Dude, carefully read my response to determine what to do next.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentize:planPlan created by /ultra-planner command

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions