perf: add per-file cache for 14x speedup on large datasets#922
perf: add per-file cache for 14x speedup on large datasets#922LEON-gittech wants to merge 1 commit intoryoppippi:mainfrom
Conversation
Add a file-level cache that stores parsed JSONL entries keyed by mtime+size. On subsequent runs, unchanged files are never re-read from disk, eliminating redundant I/O, JSON.parse, and schema validation. Key optimizations: - Per-file cache in ~/.cache/ccusage/data-cache-v1.json - In-memory stat cache to avoid redundant stat() syscalls per run - Date pre-filtering: --since/--until skip files outside the date range using cached earliest/latest timestamps, reducing processed files from thousands to only the active ones Benchmarks (8700+ session files, 3.3GB total): - Original: 74s - Cached (no filter): 18s (4x) - Cached + --since: 5s (14x) All 354 existing tests pass unchanged. Co-Authored-By: claude-flow <ruv@ruv.net>
📝 WalkthroughWalkthroughA new persistent file cache module is introduced to store compact parsed entry metadata, enabling the ccusage data loader to skip re-reading unchanged files. The cache includes versioning, validation by file modification time and size, lazy loading with fallback handling, date-range filtering, and lifecycle management via pruning and save operations. Changes
Sequence DiagramsequenceDiagram
participant Loader as Data Loader
participant Cache as Cache Module
participant FS as File System
participant Parser as JSONL Parser
Loader->>Cache: mightContainEntriesInRange(file, sinceDate, untilDate)
Cache->>Cache: Check cached earliest/latest timestamps
Cache-->>Loader: true/false (skip if out of range)
Loader->>Cache: getCachedFileData(filePath)
Cache->>Cache: Load cache store (lazy init)
Cache->>FS: stat(file)
Cache->>Cache: Validate mtimeMs & size match
alt Cache Hit
Cache-->>Loader: CachedFileData with compact entries
Loader->>Loader: Decode compact entries to UsageData
else Cache Miss
Cache-->>Loader: null
Loader->>FS: Read JSONL file stream
Loader->>Parser: Parse & validate each line
Parser-->>Loader: UsageData objects
Loader->>Loader: Encode to CompactEntry, track et/lt
Loader->>Cache: setCachedFileData(filePath, {entries, et, lt, ...})
Loader->>Cache: saveCache()
Cache->>FS: Write cache to XDG_CACHE_HOME
end
Loader->>Cache: pruneCache(existingFiles)
Cache->>Cache: Remove entries for missing files
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/ccusage/src/_file-cache.ts`:
- Around line 176-199: The function mightContainEntriesInRange currently
short-circuits based solely on cached lt/et and can skip files whose session was
appended to since the cache was written; update it to first validate the cached
file entry (via getCachedFileData or an mtime/size check against
cache.files[filePath]) and force a re-check when the file has changed, only
using cached lt/et to skip when the cached entry is confirmed up-to-date; refer
to mightContainEntriesInRange, getCache(), cache.files[filePath],
getCachedFileData (or fs.stat-based validation), and cached.lt/cached.et to
implement this ordering so stale metadata cannot bypass mtime/size validation.
In `@apps/ccusage/src/data-loader.ts`:
- Around line 907-914: The prefiltering loop using mightContainEntriesInRange
creates datePreFilteredFiles and then builds processedHashes only from those,
which scopes deduplication to the date window and allows older out-of-window
duplicates to bypass suppression; instead, compute processedHashes from the
broader set that includes projectFilteredFiles (or from cached metadata
regardless of date) before applying datePreFilteredFiles so deduplication sees
all known hashes; update the same pattern wherever datePreFilteredFiles and
processedHashes are used (the other two occurrences) to build processedHashes
from the full projectFilteredFiles/cached metadata first, then apply
mightContainEntriesInRange to produce datePreFilteredFiles for I/O filtering.
- Around line 577-628: The compacting drops message.content which prevents
getUsageLimitResetTime() from finding reset timestamps when fromCompactEntry()
reconstructs entries; update toCompactEntry() to include the minimal reset-time
payload (e.g., add a short field like entry.mr =
data.message.content?.reset_time or entry.mc = data.message.content when
present) and update fromCompactEntry() to restore that value into
message.content or message.content.reset_time so
loadSessionBlockData()/getUsageLimitResetTime() can recover reset timestamps;
change only toCompactEntry and fromCompactEntry (or alternatively stop using the
compact path for block loading) so other logic (sortFilesByTimestamp,
getEarliestTimestamp) keeps working.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3c7c23df-6f1a-4644-a6b4-8f34c500e6ab
📒 Files selected for processing (2)
apps/ccusage/src/_file-cache.tsapps/ccusage/src/data-loader.ts
| export async function mightContainEntriesInRange( | ||
| filePath: string, | ||
| sinceDate: string | undefined, | ||
| untilDate: string | undefined, | ||
| ): Promise<boolean> { | ||
| if (sinceDate == null && untilDate == null) {return true;} | ||
|
|
||
| const cache = await getCache(); | ||
| const cached = cache.files[filePath]; | ||
| if (cached == null) {return true;} // Unknown file, must process | ||
|
|
||
| // Convert ISO timestamps to YYYYMMDD for comparison | ||
| const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null; | ||
| const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null; | ||
|
|
||
| if (earliest == null || latest == null) {return true;} // No timestamp info, must process | ||
|
|
||
| // If file's latest entry is before since date, skip it | ||
| if (sinceDate != null && latest < sinceDate) {return false;} | ||
|
|
||
| // If file's earliest entry is after until date, skip it | ||
| if (untilDate != null && earliest > untilDate) {return false;} | ||
|
|
||
| return true; |
There was a problem hiding this comment.
Don't let stale range metadata short-circuit file processing.
This helper can return false before any mtime/size validation happens. If a session file is appended between runs, stale lt/et values can make --since or --until skip it entirely and getCachedFileData() never gets a chance to invalidate the entry.
🔧 Minimal fix
export async function mightContainEntriesInRange(
filePath: string,
sinceDate: string | undefined,
untilDate: string | undefined,
): Promise<boolean> {
if (sinceDate == null && untilDate == null) {return true;}
- const cache = await getCache();
- const cached = cache.files[filePath];
- if (cached == null) {return true;} // Unknown file, must process
+ const cached = await getCachedFileData(filePath);
+ if (cached == null) {return true;} // Unknown or stale file, must process
// Convert ISO timestamps to YYYYMMDD for comparison
const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null;
const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export async function mightContainEntriesInRange( | |
| filePath: string, | |
| sinceDate: string | undefined, | |
| untilDate: string | undefined, | |
| ): Promise<boolean> { | |
| if (sinceDate == null && untilDate == null) {return true;} | |
| const cache = await getCache(); | |
| const cached = cache.files[filePath]; | |
| if (cached == null) {return true;} // Unknown file, must process | |
| // Convert ISO timestamps to YYYYMMDD for comparison | |
| const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null; | |
| const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null; | |
| if (earliest == null || latest == null) {return true;} // No timestamp info, must process | |
| // If file's latest entry is before since date, skip it | |
| if (sinceDate != null && latest < sinceDate) {return false;} | |
| // If file's earliest entry is after until date, skip it | |
| if (untilDate != null && earliest > untilDate) {return false;} | |
| return true; | |
| export async function mightContainEntriesInRange( | |
| filePath: string, | |
| sinceDate: string | undefined, | |
| untilDate: string | undefined, | |
| ): Promise<boolean> { | |
| if (sinceDate == null && untilDate == null) {return true;} | |
| const cached = await getCachedFileData(filePath); | |
| if (cached == null) {return true;} // Unknown or stale file, must process | |
| // Convert ISO timestamps to YYYYMMDD for comparison | |
| const earliest = cached.et?.slice(0, 10).replace(/-/g, '') ?? null; | |
| const latest = cached.lt?.slice(0, 10).replace(/-/g, '') ?? null; | |
| if (earliest == null || latest == null) {return true;} // No timestamp info, must process | |
| // If file's latest entry is before since date, skip it | |
| if (sinceDate != null && latest < sinceDate) {return false;} | |
| // If file's earliest entry is after until date, skip it | |
| if (untilDate != null && earliest > untilDate) {return false;} | |
| return true; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/ccusage/src/_file-cache.ts` around lines 176 - 199, The function
mightContainEntriesInRange currently short-circuits based solely on cached lt/et
and can skip files whose session was appended to since the cache was written;
update it to first validate the cached file entry (via getCachedFileData or an
mtime/size check against cache.files[filePath]) and force a re-check when the
file has changed, only using cached lt/et to skip when the cached entry is
confirmed up-to-date; refer to mightContainEntriesInRange, getCache(),
cache.files[filePath], getCachedFileData (or fs.stat-based validation), and
cached.lt/cached.et to implement this ordering so stale metadata cannot bypass
mtime/size validation.
| /** | ||
| * Convert a UsageData entry to compact cache representation. | ||
| * Strips content field and shortens keys to minimize cache size. | ||
| */ | ||
| function toCompactEntry(data: UsageData): CompactEntry { | ||
| const entry: CompactEntry = { | ||
| t: data.timestamp, | ||
| u: [ | ||
| data.message.usage.input_tokens, | ||
| data.message.usage.output_tokens, | ||
| data.message.usage.cache_creation_input_tokens ?? 0, | ||
| data.message.usage.cache_read_input_tokens ?? 0, | ||
| ], | ||
| }; | ||
| if (data.message.model != null) {entry.m = data.message.model;} | ||
| if (data.message.id != null) {entry.mi = data.message.id;} | ||
| if (data.requestId != null) {entry.ri = data.requestId;} | ||
| if (data.sessionId != null) {entry.si = data.sessionId;} | ||
| if (data.message.usage.speed === 'fast') {entry.sp = 'fast';} | ||
| if (data.costUSD != null) {entry.c = data.costUSD;} | ||
| if (data.version != null) {entry.v = data.version;} | ||
| if (data.isApiErrorMessage === true) {entry.ae = true;} | ||
| if (data.cwd != null) {entry.cwd = data.cwd;} | ||
| return entry; | ||
| } | ||
|
|
||
| /** | ||
| * Reconstruct a UsageData-compatible object from a compact cache entry. | ||
| * The returned object satisfies the UsageData interface without valibot validation. | ||
| */ | ||
| function fromCompactEntry(e: CompactEntry): UsageData { | ||
| return { | ||
| timestamp: e.t, | ||
| message: { | ||
| usage: { | ||
| input_tokens: e.u[0], | ||
| output_tokens: e.u[1], | ||
| cache_creation_input_tokens: e.u[2] || undefined, | ||
| cache_read_input_tokens: e.u[3] || undefined, | ||
| speed: e.sp as 'fast' | undefined, | ||
| }, | ||
| model: e.m, | ||
| id: e.mi, | ||
| }, | ||
| costUSD: e.c, | ||
| requestId: e.ri, | ||
| version: e.v, | ||
| sessionId: e.si, | ||
| isApiErrorMessage: e.ae, | ||
| cwd: e.cwd, | ||
| } as UsageData; | ||
| } |
There was a problem hiding this comment.
This compact representation breaks block loading once the cache is consulted.
sortFilesByTimestamp() now primes the cache via getEarliestTimestamp(), so loadSessionBlockData() rereads those entries through fromCompactEntry(). Because message.content is dropped here, getUsageLimitResetTime() can no longer recover reset timestamps from API-error messages. Either keep the minimal reset-time payload in the compact entry or leave block loading on the raw JSONL path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/ccusage/src/data-loader.ts` around lines 577 - 628, The compacting drops
message.content which prevents getUsageLimitResetTime() from finding reset
timestamps when fromCompactEntry() reconstructs entries; update toCompactEntry()
to include the minimal reset-time payload (e.g., add a short field like entry.mr
= data.message.content?.reset_time or entry.mc = data.message.content when
present) and update fromCompactEntry() to restore that value into
message.content or message.content.reset_time so
loadSessionBlockData()/getUsageLimitResetTime() can recover reset timestamps;
change only toCompactEntry and fromCompactEntry (or alternatively stop using the
compact path for block loading) so other logic (sortFilesByTimestamp,
getEarliestTimestamp) keeps working.
| // Pre-filter files by date range using cached metadata (no I/O needed). | ||
| // Files whose latest entry is before `since` or earliest is after `until` are skipped. | ||
| const datePreFilteredFiles: string[] = []; | ||
| for (const file of projectFilteredFiles) { | ||
| if (await mightContainEntriesInRange(file, options?.since, options?.until)) { | ||
| datePreFilteredFiles.push(file); | ||
| } | ||
| } |
There was a problem hiding this comment.
File-level prefiltering makes global dedup date-scoped once metadata exists.
After the cache has et/lt, processedHashes is built only from datePreFilteredFiles. An older duplicate outside --since/--until no longer suppresses a newer in-range copy, so totals change at the window boundary. The same pattern is repeated on Lines 1052-1058 and Lines 1493-1499.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/ccusage/src/data-loader.ts` around lines 907 - 914, The prefiltering
loop using mightContainEntriesInRange creates datePreFilteredFiles and then
builds processedHashes only from those, which scopes deduplication to the date
window and allows older out-of-window duplicates to bypass suppression; instead,
compute processedHashes from the broader set that includes projectFilteredFiles
(or from cached metadata regardless of date) before applying
datePreFilteredFiles so deduplication sees all known hashes; update the same
pattern wherever datePreFilteredFiles and processedHashes are used (the other
two occurrences) to build processedHashes from the full
projectFilteredFiles/cached metadata first, then apply
mightContainEntriesInRange to produce datePreFilteredFiles for I/O filtering.
- PR ryoppippi#922 benchmark data (74s → 5s, 14x speedup) - Real-world optimization results (50% token reduction) - Fresh install / replace official / switch back instructions - Link to companion claude-context-optimizer plugin Co-Authored-By: Claude <noreply@anthropic.com>
Summary
~/.cache/ccusage/data-cache-v1.json) that stores parsed JSONL entries keyed bymtime + size, so unchanged session files are never re-read from diskstat()syscalls within a single run--since/--untilflags now skip files outside the date range using cached earliest/latest timestamps, dramatically reducing the number of files processedMotivation
With heavy Claude Code usage (8700+ session files, 3.3GB),
ccusagetakes 74 seconds because it re-reads and re-parses every JSONL file on every invocation — even files that haven't changed since the last run.Benchmarks
Tested with 8714 session files (177k entries, 3.3GB total):
--sinceImplementation
New file:
_file-cache.tsmtime + sizevalidationmightContainEntriesInRange()for O(1) date pre-filtering per fileModified:
data-loader.tsgetEarliestTimestamp()— cache hit returns instantly, cache miss populates full cacheloadDailyUsageData()— uses cached entries + date pre-filterloadSessionData()— uses cached entries + date pre-filterloadSessionBlockData()— uses cached entries + date pre-filterloadSessionUsageById()— uses cached entriesTest plan
--offline,--since,--until,--projectflagsSummary by CodeRabbit