Skip to content

perf: add per-file cache for 14x speedup on large datasets#922

Open
LEON-gittech wants to merge 1 commit intoryoppippi:mainfrom
LEON-gittech:feat/file-level-cache
Open

perf: add per-file cache for 14x speedup on large datasets#922
LEON-gittech wants to merge 1 commit intoryoppippi:mainfrom
LEON-gittech:feat/file-level-cache

Conversation

@LEON-gittech
Copy link
Copy Markdown

@LEON-gittech LEON-gittech commented Mar 31, 2026

Summary

  • Add a file-level cache (~/.cache/ccusage/data-cache-v1.json) that stores parsed JSONL entries keyed by mtime + size, so unchanged session files are never re-read from disk
  • Add in-memory stat cache to deduplicate stat() syscalls within a single run
  • Add date pre-filtering: --since/--until flags now skip files outside the date range using cached earliest/latest timestamps, dramatically reducing the number of files processed

Motivation

With heavy Claude Code usage (8700+ session files, 3.3GB), ccusage takes 74 seconds because it re-reads and re-parses every JSONL file on every invocation — even files that haven't changed since the last run.

Benchmarks

Tested with 8714 session files (177k entries, 3.3GB total):

Scenario Time Speedup
Original ccusage v18 74s baseline
Cached (cold, building cache) 48s 1.5x
Cached (warm, no date filter) 18s 4x
Cached (warm) + --since 5s 14x

Implementation

New file: _file-cache.ts

  • Per-file cache with mtime + size validation
  • Compact entry format (shortened keys) to minimize cache size (~43MB for 8714 files)
  • In-memory stat cache to avoid redundant syscalls
  • mightContainEntriesInRange() for O(1) date pre-filtering per file

Modified: data-loader.ts

  • getEarliestTimestamp() — cache hit returns instantly, cache miss populates full cache
  • loadDailyUsageData() — uses cached entries + date pre-filter
  • loadSessionData() — uses cached entries + date pre-filter
  • loadSessionBlockData() — uses cached entries + date pre-filter
  • loadSessionUsageById() — uses cached entries

Test plan

  • All 354 existing tests pass unchanged
  • TypeScript typecheck passes
  • Verified output matches original ccusage (token counts and costs are consistent)
  • Cache correctly invalidated when files change (mtime/size check)
  • Cache automatically pruned for deleted files
  • Works correctly with --offline, --since, --until, --project flags

Summary by CodeRabbit

  • New Features
    • Introduced automatic file-level caching for usage data to significantly improve performance by avoiding unnecessary re-reads and re-parsing of unchanged files.
    • Added intelligent date-range pre-filtering to skip unrelated files during data loading operations, reducing processing time.
    • Cache automatically validates freshness based on file modification times to ensure data accuracy while providing persistent performance benefits across runs.

Add a file-level cache that stores parsed JSONL entries keyed by
mtime+size. On subsequent runs, unchanged files are never re-read
from disk, eliminating redundant I/O, JSON.parse, and schema
validation.

Key optimizations:
- Per-file cache in ~/.cache/ccusage/data-cache-v1.json
- In-memory stat cache to avoid redundant stat() syscalls per run
- Date pre-filtering: --since/--until skip files outside the date
  range using cached earliest/latest timestamps, reducing processed
  files from thousands to only the active ones

Benchmarks (8700+ session files, 3.3GB total):
- Original:           74s
- Cached (no filter):  18s  (4x)
- Cached + --since:     5s  (14x)

All 354 existing tests pass unchanged.

Co-Authored-By: claude-flow <ruv@ruv.net>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

A new persistent file cache module is introduced to store compact parsed entry metadata, enabling the ccusage data loader to skip re-reading unchanged files. The cache includes versioning, validation by file modification time and size, lazy loading with fallback handling, date-range filtering, and lifecycle management via pruning and save operations.

Changes

Cohort / File(s) Summary
New Cache Module
apps/ccusage/src/_file-cache.ts
Introduces a persistent JSONL cache with types (CompactEntry, CachedFileData), core functions (getCache(), getCachedFileData(), setCachedFileData(), statCached()), lifecycle management (saveCache(), pruneCache(), resetCache()), and query filtering (mightContainEntriesInRange()) using versioned on-disk storage with in-memory state tracking and per-run stat caching.
Data Loader Integration
apps/ccusage/src/data-loader.ts
Refactors data loaders to use the cache system via new getFileEntries() function with compact encoding/decoding helpers (toCompactEntry(), fromCompactEntry()). Replaces per-line JSONL parsing with cached validation and stream processing. Updates loadDailyUsageData(), loadSessionData(), loadSessionUsageById(), and loadSessionBlockData() to pre-filter files using cached date-range metadata, prune stale cache entries, and persist cache after processing.

Sequence Diagram

sequenceDiagram
    participant Loader as Data Loader
    participant Cache as Cache Module
    participant FS as File System
    participant Parser as JSONL Parser

    Loader->>Cache: mightContainEntriesInRange(file, sinceDate, untilDate)
    Cache->>Cache: Check cached earliest/latest timestamps
    Cache-->>Loader: true/false (skip if out of range)
    
    Loader->>Cache: getCachedFileData(filePath)
    Cache->>Cache: Load cache store (lazy init)
    Cache->>FS: stat(file)
    Cache->>Cache: Validate mtimeMs & size match
    alt Cache Hit
        Cache-->>Loader: CachedFileData with compact entries
        Loader->>Loader: Decode compact entries to UsageData
    else Cache Miss
        Cache-->>Loader: null
        Loader->>FS: Read JSONL file stream
        Loader->>Parser: Parse & validate each line
        Parser-->>Loader: UsageData objects
        Loader->>Loader: Encode to CompactEntry, track et/lt
        Loader->>Cache: setCachedFileData(filePath, {entries, et, lt, ...})
        Loader->>Cache: saveCache()
        Cache->>FS: Write cache to XDG_CACHE_HOME
    end
    
    Loader->>Cache: pruneCache(existingFiles)
    Cache->>Cache: Remove entries for missing files
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 A cache so swift, a hare's delight,
No file re-read, just metadata's might,
Timestamps cached, the entries compressed,
JSONL parsed once, then put to rest,
Swift filtering by date, and lo—
Unused data we needn't know!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf: add per-file cache for 14x speedup on large datasets' directly and clearly describes the main change—adding a per-file cache to improve performance. It aligns with the primary objective of the changeset and is specific enough for understanding the core improvement.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/ccusage/src/_file-cache.ts`:
- Around line 176-199: The function mightContainEntriesInRange currently
short-circuits based solely on cached lt/et and can skip files whose session was
appended to since the cache was written; update it to first validate the cached
file entry (via getCachedFileData or an mtime/size check against
cache.files[filePath]) and force a re-check when the file has changed, only
using cached lt/et to skip when the cached entry is confirmed up-to-date; refer
to mightContainEntriesInRange, getCache(), cache.files[filePath],
getCachedFileData (or fs.stat-based validation), and cached.lt/cached.et to
implement this ordering so stale metadata cannot bypass mtime/size validation.

In `@apps/ccusage/src/data-loader.ts`:
- Around line 907-914: The prefiltering loop using mightContainEntriesInRange
creates datePreFilteredFiles and then builds processedHashes only from those,
which scopes deduplication to the date window and allows older out-of-window
duplicates to bypass suppression; instead, compute processedHashes from the
broader set that includes projectFilteredFiles (or from cached metadata
regardless of date) before applying datePreFilteredFiles so deduplication sees
all known hashes; update the same pattern wherever datePreFilteredFiles and
processedHashes are used (the other two occurrences) to build processedHashes
from the full projectFilteredFiles/cached metadata first, then apply
mightContainEntriesInRange to produce datePreFilteredFiles for I/O filtering.
- Around line 577-628: The compacting drops message.content which prevents
getUsageLimitResetTime() from finding reset timestamps when fromCompactEntry()
reconstructs entries; update toCompactEntry() to include the minimal reset-time
payload (e.g., add a short field like entry.mr =
data.message.content?.reset_time or entry.mc = data.message.content when
present) and update fromCompactEntry() to restore that value into
message.content or message.content.reset_time so
loadSessionBlockData()/getUsageLimitResetTime() can recover reset timestamps;
change only toCompactEntry and fromCompactEntry (or alternatively stop using the
compact path for block loading) so other logic (sortFilesByTimestamp,
getEarliestTimestamp) keeps working.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3c7c23df-6f1a-4644-a6b4-8f34c500e6ab

📥 Commits

Reviewing files that changed from the base of the PR and between 61ee04d and a8346d5.

📒 Files selected for processing (2)
  • apps/ccusage/src/_file-cache.ts
  • apps/ccusage/src/data-loader.ts

Comment on lines +176 to +199
export async function mightContainEntriesInRange(
filePath: string,
sinceDate: string | undefined,
untilDate: string | undefined,
): Promise<boolean> {
if (sinceDate == null && untilDate == null) {return true;}

const cache = await getCache();
const cached = cache.files[filePath];
if (cached == null) {return true;} // Unknown file, must process

// Convert ISO timestamps to YYYYMMDD for comparison
const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null;
const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null;

if (earliest == null || latest == null) {return true;} // No timestamp info, must process

// If file's latest entry is before since date, skip it
if (sinceDate != null && latest < sinceDate) {return false;}

// If file's earliest entry is after until date, skip it
if (untilDate != null && earliest > untilDate) {return false;}

return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't let stale range metadata short-circuit file processing.

This helper can return false before any mtime/size validation happens. If a session file is appended between runs, stale lt/et values can make --since or --until skip it entirely and getCachedFileData() never gets a chance to invalidate the entry.

🔧 Minimal fix
 export async function mightContainEntriesInRange(
 	filePath: string,
 	sinceDate: string | undefined,
 	untilDate: string | undefined,
 ): Promise<boolean> {
 	if (sinceDate == null && untilDate == null) {return true;}
 
-	const cache = await getCache();
-	const cached = cache.files[filePath];
-	if (cached == null) {return true;} // Unknown file, must process
+	const cached = await getCachedFileData(filePath);
+	if (cached == null) {return true;} // Unknown or stale file, must process
 
 	// Convert ISO timestamps to YYYYMMDD for comparison
 	const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null;
 	const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export async function mightContainEntriesInRange(
filePath: string,
sinceDate: string | undefined,
untilDate: string | undefined,
): Promise<boolean> {
if (sinceDate == null && untilDate == null) {return true;}
const cache = await getCache();
const cached = cache.files[filePath];
if (cached == null) {return true;} // Unknown file, must process
// Convert ISO timestamps to YYYYMMDD for comparison
const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null;
const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null;
if (earliest == null || latest == null) {return true;} // No timestamp info, must process
// If file's latest entry is before since date, skip it
if (sinceDate != null && latest < sinceDate) {return false;}
// If file's earliest entry is after until date, skip it
if (untilDate != null && earliest > untilDate) {return false;}
return true;
export async function mightContainEntriesInRange(
filePath: string,
sinceDate: string | undefined,
untilDate: string | undefined,
): Promise<boolean> {
if (sinceDate == null && untilDate == null) {return true;}
const cached = await getCachedFileData(filePath);
if (cached == null) {return true;} // Unknown or stale file, must process
// Convert ISO timestamps to YYYYMMDD for comparison
const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null;
const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null;
if (earliest == null || latest == null) {return true;} // No timestamp info, must process
// If file's latest entry is before since date, skip it
if (sinceDate != null && latest < sinceDate) {return false;}
// If file's earliest entry is after until date, skip it
if (untilDate != null && earliest > untilDate) {return false;}
return true;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/ccusage/src/_file-cache.ts` around lines 176 - 199, The function
mightContainEntriesInRange currently short-circuits based solely on cached lt/et
and can skip files whose session was appended to since the cache was written;
update it to first validate the cached file entry (via getCachedFileData or an
mtime/size check against cache.files[filePath]) and force a re-check when the
file has changed, only using cached lt/et to skip when the cached entry is
confirmed up-to-date; refer to mightContainEntriesInRange, getCache(),
cache.files[filePath], getCachedFileData (or fs.stat-based validation), and
cached.lt/cached.et to implement this ordering so stale metadata cannot bypass
mtime/size validation.

Comment on lines +577 to +628
/**
* Convert a UsageData entry to compact cache representation.
* Strips content field and shortens keys to minimize cache size.
*/
function toCompactEntry(data: UsageData): CompactEntry {
const entry: CompactEntry = {
t: data.timestamp,
u: [
data.message.usage.input_tokens,
data.message.usage.output_tokens,
data.message.usage.cache_creation_input_tokens ?? 0,
data.message.usage.cache_read_input_tokens ?? 0,
],
};
if (data.message.model != null) {entry.m = data.message.model;}
if (data.message.id != null) {entry.mi = data.message.id;}
if (data.requestId != null) {entry.ri = data.requestId;}
if (data.sessionId != null) {entry.si = data.sessionId;}
if (data.message.usage.speed === 'fast') {entry.sp = 'fast';}
if (data.costUSD != null) {entry.c = data.costUSD;}
if (data.version != null) {entry.v = data.version;}
if (data.isApiErrorMessage === true) {entry.ae = true;}
if (data.cwd != null) {entry.cwd = data.cwd;}
return entry;
}

/**
* Reconstruct a UsageData-compatible object from a compact cache entry.
* The returned object satisfies the UsageData interface without valibot validation.
*/
function fromCompactEntry(e: CompactEntry): UsageData {
return {
timestamp: e.t,
message: {
usage: {
input_tokens: e.u[0],
output_tokens: e.u[1],
cache_creation_input_tokens: e.u[2] || undefined,
cache_read_input_tokens: e.u[3] || undefined,
speed: e.sp as 'fast' | undefined,
},
model: e.m,
id: e.mi,
},
costUSD: e.c,
requestId: e.ri,
version: e.v,
sessionId: e.si,
isApiErrorMessage: e.ae,
cwd: e.cwd,
} as UsageData;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This compact representation breaks block loading once the cache is consulted.

sortFilesByTimestamp() now primes the cache via getEarliestTimestamp(), so loadSessionBlockData() rereads those entries through fromCompactEntry(). Because message.content is dropped here, getUsageLimitResetTime() can no longer recover reset timestamps from API-error messages. Either keep the minimal reset-time payload in the compact entry or leave block loading on the raw JSONL path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/ccusage/src/data-loader.ts` around lines 577 - 628, The compacting drops
message.content which prevents getUsageLimitResetTime() from finding reset
timestamps when fromCompactEntry() reconstructs entries; update toCompactEntry()
to include the minimal reset-time payload (e.g., add a short field like entry.mr
= data.message.content?.reset_time or entry.mc = data.message.content when
present) and update fromCompactEntry() to restore that value into
message.content or message.content.reset_time so
loadSessionBlockData()/getUsageLimitResetTime() can recover reset timestamps;
change only toCompactEntry and fromCompactEntry (or alternatively stop using the
compact path for block loading) so other logic (sortFilesByTimestamp,
getEarliestTimestamp) keeps working.

Comment on lines +907 to +914
// Pre-filter files by date range using cached metadata (no I/O needed).
// Files whose latest entry is before `since` or earliest is after `until` are skipped.
const datePreFilteredFiles: string[] = [];
for (const file of projectFilteredFiles) {
if (await mightContainEntriesInRange(file, options?.since, options?.until)) {
datePreFilteredFiles.push(file);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

File-level prefiltering makes global dedup date-scoped once metadata exists.

After the cache has et/lt, processedHashes is built only from datePreFilteredFiles. An older duplicate outside --since/--until no longer suppresses a newer in-range copy, so totals change at the window boundary. The same pattern is repeated on Lines 1052-1058 and Lines 1493-1499.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/ccusage/src/data-loader.ts` around lines 907 - 914, The prefiltering
loop using mightContainEntriesInRange creates datePreFilteredFiles and then
builds processedHashes only from those, which scopes deduplication to the date
window and allows older out-of-window duplicates to bypass suppression; instead,
compute processedHashes from the broader set that includes projectFilteredFiles
(or from cached metadata regardless of date) before applying
datePreFilteredFiles so deduplication sees all known hashes; update the same
pattern wherever datePreFilteredFiles and processedHashes are used (the other
two occurrences) to build processedHashes from the full
projectFilteredFiles/cached metadata first, then apply
mightContainEntriesInRange to produce datePreFilteredFiles for I/O filtering.

LEON-gittech added a commit to LEON-gittech/ccusage that referenced this pull request Apr 1, 2026
- PR ryoppippi#922 benchmark data (74s → 5s, 14x speedup)
- Real-world optimization results (50% token reduction)
- Fresh install / replace official / switch back instructions
- Link to companion claude-context-optimizer plugin

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant