perf: add per-file cache for 14x speedup on large datasets by LEON-gittech · Pull Request #922 · ryoppippi/ccusage

LEON-gittech · 2026-03-31T17:56:19Z

Summary

Add a file-level cache (~/.cache/ccusage/data-cache-v1.json) that stores parsed JSONL entries keyed by mtime + size, so unchanged session files are never re-read from disk
Add in-memory stat cache to deduplicate stat() syscalls within a single run
Add date pre-filtering: --since/--until flags now skip files outside the date range using cached earliest/latest timestamps, dramatically reducing the number of files processed

Motivation

With heavy Claude Code usage (8700+ session files, 3.3GB), ccusage takes 74 seconds because it re-reads and re-parses every JSONL file on every invocation — even files that haven't changed since the last run.

Benchmarks

Tested with 8714 session files (177k entries, 3.3GB total):

Scenario	Time	Speedup
Original ccusage v18	74s	baseline
Cached (cold, building cache)	48s	1.5x
Cached (warm, no date filter)	18s	4x
Cached (warm) + `--since`	5s	14x

Implementation

New file: _file-cache.ts

Per-file cache with mtime + size validation
Compact entry format (shortened keys) to minimize cache size (~43MB for 8714 files)
In-memory stat cache to avoid redundant syscalls
mightContainEntriesInRange() for O(1) date pre-filtering per file

Modified: data-loader.ts

getEarliestTimestamp() — cache hit returns instantly, cache miss populates full cache
loadDailyUsageData() — uses cached entries + date pre-filter
loadSessionData() — uses cached entries + date pre-filter
loadSessionBlockData() — uses cached entries + date pre-filter
loadSessionUsageById() — uses cached entries

Test plan

All 354 existing tests pass unchanged
TypeScript typecheck passes
Verified output matches original ccusage (token counts and costs are consistent)
Cache correctly invalidated when files change (mtime/size check)
Cache automatically pruned for deleted files
Works correctly with --offline, --since, --until, --project flags

Summary by CodeRabbit

New Features
- Introduced automatic file-level caching for usage data to significantly improve performance by avoiding unnecessary re-reads and re-parsing of unchanged files.
- Added intelligent date-range pre-filtering to skip unrelated files during data loading operations, reducing processing time.
- Cache automatically validates freshness based on file modification times to ensure data accuracy while providing persistent performance benefits across runs.

Add a file-level cache that stores parsed JSONL entries keyed by mtime+size. On subsequent runs, unchanged files are never re-read from disk, eliminating redundant I/O, JSON.parse, and schema validation. Key optimizations: - Per-file cache in ~/.cache/ccusage/data-cache-v1.json - In-memory stat cache to avoid redundant stat() syscalls per run - Date pre-filtering: --since/--until skip files outside the date range using cached earliest/latest timestamps, reducing processed files from thousands to only the active ones Benchmarks (8700+ session files, 3.3GB total): - Original: 74s - Cached (no filter): 18s (4x) - Cached + --since: 5s (14x) All 354 existing tests pass unchanged. Co-Authored-By: claude-flow <ruv@ruv.net>

coderabbitai · 2026-03-31T17:56:33Z

📝 Walkthrough

Walkthrough

A new persistent file cache module is introduced to store compact parsed entry metadata, enabling the ccusage data loader to skip re-reading unchanged files. The cache includes versioning, validation by file modification time and size, lazy loading with fallback handling, date-range filtering, and lifecycle management via pruning and save operations.

Changes

Cohort / File(s)	Summary
New Cache Module `apps/ccusage/src/_file-cache.ts`	Introduces a persistent JSONL cache with types (`CompactEntry`, `CachedFileData`), core functions (`getCache()`, `getCachedFileData()`, `setCachedFileData()`, `statCached()`), lifecycle management (`saveCache()`, `pruneCache()`, `resetCache()`), and query filtering (`mightContainEntriesInRange()`) using versioned on-disk storage with in-memory state tracking and per-run stat caching.
Data Loader Integration `apps/ccusage/src/data-loader.ts`	Refactors data loaders to use the cache system via new `getFileEntries()` function with compact encoding/decoding helpers (`toCompactEntry()`, `fromCompactEntry()`). Replaces per-line JSONL parsing with cached validation and stream processing. Updates `loadDailyUsageData()`, `loadSessionData()`, `loadSessionUsageById()`, and `loadSessionBlockData()` to pre-filter files using cached date-range metadata, prune stale cache entries, and persist cache after processing.

Sequence Diagram

sequenceDiagram
    participant Loader as Data Loader
    participant Cache as Cache Module
    participant FS as File System
    participant Parser as JSONL Parser

    Loader->>Cache: mightContainEntriesInRange(file, sinceDate, untilDate)
    Cache->>Cache: Check cached earliest/latest timestamps
    Cache-->>Loader: true/false (skip if out of range)
    
    Loader->>Cache: getCachedFileData(filePath)
    Cache->>Cache: Load cache store (lazy init)
    Cache->>FS: stat(file)
    Cache->>Cache: Validate mtimeMs & size match
    alt Cache Hit
        Cache-->>Loader: CachedFileData with compact entries
        Loader->>Loader: Decode compact entries to UsageData
    else Cache Miss
        Cache-->>Loader: null
        Loader->>FS: Read JSONL file stream
        Loader->>Parser: Parse & validate each line
        Parser-->>Loader: UsageData objects
        Loader->>Loader: Encode to CompactEntry, track et/lt
        Loader->>Cache: setCachedFileData(filePath, {entries, et, lt, ...})
        Loader->>Cache: saveCache()
        Cache->>FS: Write cache to XDG_CACHE_HOME
    end
    
    Loader->>Cache: pruneCache(existingFiles)
    Cache->>Cache: Remove entries for missing files

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 A cache so swift, a hare's delight,
No file re-read, just metadata's might,
Timestamps cached, the entries compressed,
JSONL parsed once, then put to rest,
Swift filtering by date, and lo—
Unused data we needn't know!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'perf: add per-file cache for 14x speedup on large datasets' directly and clearly describes the main change—adding a per-file cache to improve performance. It aligns with the primary objective of the changeset and is specific enough for understanding the core improvement.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/ccusage/src/_file-cache.ts`:
- Around line 176-199: The function mightContainEntriesInRange currently
short-circuits based solely on cached lt/et and can skip files whose session was
appended to since the cache was written; update it to first validate the cached
file entry (via getCachedFileData or an mtime/size check against
cache.files[filePath]) and force a re-check when the file has changed, only
using cached lt/et to skip when the cached entry is confirmed up-to-date; refer
to mightContainEntriesInRange, getCache(), cache.files[filePath],
getCachedFileData (or fs.stat-based validation), and cached.lt/cached.et to
implement this ordering so stale metadata cannot bypass mtime/size validation.

In `@apps/ccusage/src/data-loader.ts`:
- Around line 907-914: The prefiltering loop using mightContainEntriesInRange
creates datePreFilteredFiles and then builds processedHashes only from those,
which scopes deduplication to the date window and allows older out-of-window
duplicates to bypass suppression; instead, compute processedHashes from the
broader set that includes projectFilteredFiles (or from cached metadata
regardless of date) before applying datePreFilteredFiles so deduplication sees
all known hashes; update the same pattern wherever datePreFilteredFiles and
processedHashes are used (the other two occurrences) to build processedHashes
from the full projectFilteredFiles/cached metadata first, then apply
mightContainEntriesInRange to produce datePreFilteredFiles for I/O filtering.
- Around line 577-628: The compacting drops message.content which prevents
getUsageLimitResetTime() from finding reset timestamps when fromCompactEntry()
reconstructs entries; update toCompactEntry() to include the minimal reset-time
payload (e.g., add a short field like entry.mr =
data.message.content?.reset_time or entry.mc = data.message.content when
present) and update fromCompactEntry() to restore that value into
message.content or message.content.reset_time so
loadSessionBlockData()/getUsageLimitResetTime() can recover reset timestamps;
change only toCompactEntry and fromCompactEntry (or alternatively stop using the
compact path for block loading) so other logic (sortFilesByTimestamp,
getEarliestTimestamp) keeps working.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3c7c23df-6f1a-4644-a6b4-8f34c500e6ab

📥 Commits

Reviewing files that changed from the base of the PR and between 61ee04d and a8346d5.

📒 Files selected for processing (2)

apps/ccusage/src/_file-cache.ts
apps/ccusage/src/data-loader.ts

coderabbitai · 2026-03-31T18:07:40Z

apps/ccusage/src/_file-cache.ts

+export async function mightContainEntriesInRange(
+	filePath: string,
+	sinceDate: string | undefined,
+	untilDate: string | undefined,
+): Promise<boolean> {
+	if (sinceDate == null && untilDate == null) {return true;}
+
+	const cache = await getCache();
+	const cached = cache.files[filePath];
+	if (cached == null) {return true;} // Unknown file, must process
+
+	// Convert ISO timestamps to YYYYMMDD for comparison
+	const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null;
+	const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null;
+
+	if (earliest == null || latest == null) {return true;} // No timestamp info, must process
+
+	// If file's latest entry is before since date, skip it
+	if (sinceDate != null && latest < sinceDate) {return false;}
+
+	// If file's earliest entry is after until date, skip it
+	if (untilDate != null && earliest > untilDate) {return false;}
+
+	return true;


⚠️ Potential issue | 🔴 Critical

Don't let stale range metadata short-circuit file processing.

This helper can return false before any mtime/size validation happens. If a session file is appended between runs, stale lt/et values can make --since or --until skip it entirely and getCachedFileData() never gets a chance to invalidate the entry.

🔧 Minimal fix

export async function mightContainEntriesInRange( filePath: string, sinceDate: string | undefined, untilDate: string | undefined, ): Promise<boolean> { if (sinceDate == null && untilDate == null) {return true;} - const cache = await getCache(); - const cached = cache.files[filePath]; - if (cached == null) {return true;} // Unknown file, must process + const cached = await getCachedFileData(filePath); + if (cached == null) {return true;} // Unknown or stale file, must process // Convert ISO timestamps to YYYYMMDD for comparison const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null; const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export async function mightContainEntriesInRange(

filePath: string,

sinceDate: string | undefined,

untilDate: string | undefined,

): Promise<boolean> {

if (sinceDate == null && untilDate == null) {return true;}

const cache = await getCache();

const cached = cache.files[filePath];

if (cached == null) {return true;} // Unknown file, must process

// Convert ISO timestamps to YYYYMMDD for comparison

const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null;

const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null;

if (earliest == null || latest == null) {return true;} // No timestamp info, must process

// If file's latest entry is before since date, skip it

if (sinceDate != null && latest < sinceDate) {return false;}

// If file's earliest entry is after until date, skip it

if (untilDate != null && earliest > untilDate) {return false;}

return true;

export async function mightContainEntriesInRange(

filePath: string,

sinceDate: string | undefined,

untilDate: string | undefined,

): Promise<boolean> {

if (sinceDate == null && untilDate == null) {return true;}

const cached = await getCachedFileData(filePath);

if (cached == null) {return true;} // Unknown or stale file, must process

// Convert ISO timestamps to YYYYMMDD for comparison

const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null;

const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null;

if (earliest == null || latest == null) {return true;} // No timestamp info, must process

// If file's latest entry is before since date, skip it

if (sinceDate != null && latest < sinceDate) {return false;}

// If file's earliest entry is after until date, skip it

if (untilDate != null && earliest > untilDate) {return false;}

return true;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/ccusage/src/_file-cache.ts` around lines 176 - 199, The function mightContainEntriesInRange currently short-circuits based solely on cached lt/et and can skip files whose session was appended to since the cache was written; update it to first validate the cached file entry (via getCachedFileData or an mtime/size check against cache.files[filePath]) and force a re-check when the file has changed, only using cached lt/et to skip when the cached entry is confirmed up-to-date; refer to mightContainEntriesInRange, getCache(), cache.files[filePath], getCachedFileData (or fs.stat-based validation), and cached.lt/cached.et to implement this ordering so stale metadata cannot bypass mtime/size validation.

coderabbitai · 2026-03-31T18:07:40Z

apps/ccusage/src/data-loader.ts

+/**
+ * Convert a UsageData entry to compact cache representation.
+ * Strips content field and shortens keys to minimize cache size.
+ */
+function toCompactEntry(data: UsageData): CompactEntry {
+	const entry: CompactEntry = {
+		t: data.timestamp,
+		u: [
+			data.message.usage.input_tokens,
+			data.message.usage.output_tokens,
+			data.message.usage.cache_creation_input_tokens ?? 0,
+			data.message.usage.cache_read_input_tokens ?? 0,
+		],
+	};
+	if (data.message.model != null) {entry.m = data.message.model;}
+	if (data.message.id != null) {entry.mi = data.message.id;}
+	if (data.requestId != null) {entry.ri = data.requestId;}
+	if (data.sessionId != null) {entry.si = data.sessionId;}
+	if (data.message.usage.speed === 'fast') {entry.sp = 'fast';}
+	if (data.costUSD != null) {entry.c = data.costUSD;}
+	if (data.version != null) {entry.v = data.version;}
+	if (data.isApiErrorMessage === true) {entry.ae = true;}
+	if (data.cwd != null) {entry.cwd = data.cwd;}
+	return entry;
+}
+
+/**
+ * Reconstruct a UsageData-compatible object from a compact cache entry.
+ * The returned object satisfies the UsageData interface without valibot validation.
+ */
+function fromCompactEntry(e: CompactEntry): UsageData {
+	return {
+		timestamp: e.t,
+		message: {
+			usage: {
+				input_tokens: e.u[0],
+				output_tokens: e.u[1],
+				cache_creation_input_tokens: e.u[2] || undefined,
+				cache_read_input_tokens: e.u[3] || undefined,
+				speed: e.sp as 'fast' | undefined,
+			},
+			model: e.m,
+			id: e.mi,
+		},
+		costUSD: e.c,
+		requestId: e.ri,
+		version: e.v,
+		sessionId: e.si,
+		isApiErrorMessage: e.ae,
+		cwd: e.cwd,
+	} as UsageData;
+}


⚠️ Potential issue | 🟠 Major

This compact representation breaks block loading once the cache is consulted.

sortFilesByTimestamp() now primes the cache via getEarliestTimestamp(), so loadSessionBlockData() rereads those entries through fromCompactEntry(). Because message.content is dropped here, getUsageLimitResetTime() can no longer recover reset timestamps from API-error messages. Either keep the minimal reset-time payload in the compact entry or leave block loading on the raw JSONL path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/ccusage/src/data-loader.ts` around lines 577 - 628, The compacting drops message.content which prevents getUsageLimitResetTime() from finding reset timestamps when fromCompactEntry() reconstructs entries; update toCompactEntry() to include the minimal reset-time payload (e.g., add a short field like entry.mr = data.message.content?.reset_time or entry.mc = data.message.content when present) and update fromCompactEntry() to restore that value into message.content or message.content.reset_time so loadSessionBlockData()/getUsageLimitResetTime() can recover reset timestamps; change only toCompactEntry and fromCompactEntry (or alternatively stop using the compact path for block loading) so other logic (sortFilesByTimestamp, getEarliestTimestamp) keeps working.

coderabbitai · 2026-03-31T18:07:40Z

apps/ccusage/src/data-loader.ts

+	// Pre-filter files by date range using cached metadata (no I/O needed).
+	// Files whose latest entry is before `since` or earliest is after `until` are skipped.
+	const datePreFilteredFiles: string[] = [];
+	for (const file of projectFilteredFiles) {
+		if (await mightContainEntriesInRange(file, options?.since, options?.until)) {
+			datePreFilteredFiles.push(file);
+		}
+	}


⚠️ Potential issue | 🟠 Major

File-level prefiltering makes global dedup date-scoped once metadata exists.

After the cache has et/lt, processedHashes is built only from datePreFilteredFiles. An older duplicate outside --since/--until no longer suppresses a newer in-range copy, so totals change at the window boundary. The same pattern is repeated on Lines 1052-1058 and Lines 1493-1499.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/ccusage/src/data-loader.ts` around lines 907 - 914, The prefiltering loop using mightContainEntriesInRange creates datePreFilteredFiles and then builds processedHashes only from those, which scopes deduplication to the date window and allows older out-of-window duplicates to bypass suppression; instead, compute processedHashes from the broader set that includes projectFilteredFiles (or from cached metadata regardless of date) before applying datePreFilteredFiles so deduplication sees all known hashes; update the same pattern wherever datePreFilteredFiles and processedHashes are used (the other two occurrences) to build processedHashes from the full projectFilteredFiles/cached metadata first, then apply mightContainEntriesInRange to produce datePreFilteredFiles for I/O filtering.

- PR ryoppippi#922 benchmark data (74s → 5s, 14x speedup) - Real-world optimization results (50% token reduction) - Fresh install / replace official / switch back instructions - Link to companion claude-context-optimizer plugin Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: add per-file cache for 14x speedup on large datasets#922

perf: add per-file cache for 14x speedup on large datasets#922
LEON-gittech wants to merge 1 commit intoryoppippi:mainfrom
LEON-gittech:feat/file-level-cache

LEON-gittech commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LEON-gittech commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Benchmarks

Implementation

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LEON-gittech commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading