Add benchmark gates for recurring sync and DB performance regressions by mariusvniekerk · Pull Request #985 · kenn-io/agentsview

mariusvniekerk · 2026-07-03T16:36:03Z

Why

agentsview has repeatedly shipped the same performance regression shape: sync work that should scale with new data silently starts scaling with archive size, and nothing in CI notices until a machine spends minutes in a sync pass. Recent history: discovery recomputing root-derived project info per source (#912), the provider migration dropping pre-parse freshness skips, O(session-history) work on every streamed append (#954), bulk-ingest throughput (#411), and per-row json_extract in the usage scan (#309). Each fix landed without a gate, so any refactor could quietly reintroduce its class.

What this adds

Layer 1 — deterministic work-count invariants in the normal test suite (immune to runner noise): TestWarmFullSyncDoesNoBulkWriteWork asserts a second full sync over an unchanged Claude archive skips every session and runs zero bulk-write batches, via the existing SyncStats/PhaseStats counters. It complements the existing Vibe freshness, signal-scheduler debounce, and parser once-per-root seam tests; docs/internal/performance-gates.md maps each historical regression class to the gate that covers it.

Layer 2 — a per-PR benchmark gate (.github/workflows/bench.yml): hot-path benchmarks (warm no-op full sync, incremental append into a 1k-message session, cold-archive ingest through both the default and the resync bulk-write pipelines, streaming chunk-merge replace, batched insert, plus the existing usage and secret-scan benchmarks) run on the PR head and its merge base in the same job on the same runner, each side via make bench-gate — the Makefile is the single source of truth for the gated package list, sample count, and iteration count — and are compared by cmd/benchgate.

benchgate builds on golang.org/x/perf (benchfmt parsing, benchmath — the statistics behind benchstat) and adds only the policy benchstat doesn't provide: thresholds, floors, and a failing exit code. The policy is tuned for shared-runner noise:

Gating is per benchmark; nothing is averaged across benchmarks.
allocs/op (1.25x) and B/op (1.35x) are deterministic for fixed code and iteration count, so they gate the candidate's worst -count run against the baseline median — a single outlier run is a real intermittent allocation path and fails. Failure lines include the baseline's worst run so pre-existing instability is visible.
Time gates medians at a loose 2.0x and requires Mann-Whitney significance, so one slow run can't flake a PR but algorithmic blowups fail. Fewer than 5 candidate samples is a loud configuration error; a baseline with fewer than 5 samples (a legitimately partial base run) is reported and not gated.
Benchmarks present on only one side never gate, which makes the gate allowlist-free: every benchmark in the gated packages runs, and new ones auto-join once they exist on both sides. Because each side runs its own commit's make bench-gate, a PR that grows BENCH_GATE_PACKAGES cannot break the base run.
A failing merge-base run degrades to whatever partial output it produced, plus a workflow warning — one broken package does not erase the baseline for the others.
Captures with unparseable result lines (for example log output interleaved into a Benchmark line) fail loudly with exit 2 instead of silently dropping those benchmarks from both sides; the sync benchmarks silence the engine's logger for exactly this reason. A gated unit the baseline has but the candidate lost (e.g. -benchmem dropped) is also a configuration error; a unit missing from the baseline, and custom b.ReportMetric units, are reported as not gated.
The gate runs a fixed -benchtime=20x so baseline and candidate measure identical workloads (two benchmarks deliberately grow their fixture per iteration); the count and benchtime are evaluated from the PR head's Makefile and passed into the merge-base run so changing them cannot skew the comparison. Benchmarks keep fixture construction and one-time leading-edge work out of the timed region so the gated ratios measure product cost, not test-helper cost.

Tradeoffs and limitations

Subtle wall-clock regressions (< 2x or statistically ambiguous) are deliberately not time-gated; the allocation metrics act as the sensitive proxy for O(archive)-vs-O(delta) mistakes.
On this PR itself, the merge base predates the bench-gate Make target, so the baseline step degrades to empty (with a visible warning) and every benchmark reports as new; the gate is fully live from the first PR after this merges.
The resync sanitize-pass class (fixed on the unmerged heavy-angora branch) is not gated yet; a gate would fail against current main until that lands.
golang.org/x/perf's module requirements force minor bumps of x/net, x/oauth2, x/crypto, x/exp, and x/text via MVS.
The workflow runs only on PRs that touch Go files, go.mod/go.sum, the Makefile, or the workflow itself; docs- and frontend-only PRs skip it. Cross-backend query benchmarks stay in the opt-in internal/backendbench (Docker) and are not part of the gate.

Where to look

cmd/benchgate/ — the comparison policy (worst-run gating, significance requirement, corrupted-capture and missing-unit reporting)
.github/workflows/bench.yml — base-vs-head orchestration and partial-baseline degradation
internal/sync/engine_bench_test.go, internal/db/messages_bench_test.go, internal/sync/perf_invariant_test.go
docs/internal/performance-gates.md — regression history and how to add a gated benchmark

🤖 Generated with Claude Code

^{generated by a clanker}

…essions agentsview has repeatedly shipped the same performance regression shape: sync work that should scale with new data silently starts scaling with archive size, and nothing in CI notices until a user's machine spends minutes in a sync pass (#912 discovery recomputing root project info per source, the provider-migration loss of pre-parse freshness skips, #954's O(history) work per streamed append, #309's per-row json_extract scan). The fixes landed, but each class had no gate, so the next refactor could reintroduce it. This adds a two-layer gate keyed to those historical classes: - Deterministic work-count invariants that run in the normal test suite and are immune to runner noise: a warm full sync over an unchanged Claude archive must skip every session and run zero bulk-write batches (asserted via SyncStats + PhaseStats). This complements the existing Vibe freshness test, which covers the generic providerSourceUnchangedInDB path but not Claude's own pre-parse skip. - Hot-path benchmarks (warm no-op full sync, incremental append into a 1k-message session, cold-archive ingest, streaming chunk-merge replace, batched message insert) plus the existing GetDailyUsage and secrets benchmarks, compared on every PR against the merge base by a new bench.yml workflow. cmd/benchgate gates hard on allocs/op and B/op — deterministic on a fixed runner, and an O(archive)-vs-O(delta) regression always inflates them — while ns/op gets a loose 2x limit so runner noise cannot flake a PR but algorithmic blowups still fail. Benchmarks present on only one side are reported, never gated, so the first PR (empty baseline) and future benchmark additions/removals cannot wedge the gate. Both sides run in one job on one runner to keep the comparison apples to apples; a broken merge base degrades to an empty baseline instead of blocking. The warm-noop and append benchmarks self-assert the invariant they protect (Synced==0, appends absorbed), so a refactor that silently changes what the benchmark measures fails loudly instead of producing a meaningless number. docs/internal/performance-gates.md records the regression history, which gate covers which class, and how to add a benchmark to the pattern (kept in sync between bench.yml and the Makefile bench-gate target). Cross-backend query benchmarks stay in internal/backendbench: they need Docker/testcontainers and answer a different question (backend parity), so they are not part of the per-PR gate.

…f hand-rolled parsing The first benchgate cut reimplemented two things the Go performance tooling already provides: a parser for the benchmark output format and a comparison statistic. Swap both for golang.org/x/perf — benchfmt for parsing (handles the full format spec: config lines, unit tidying, sub-benchmarks, stray non-benchmark output) and benchmath, the statistics engine behind the rewritten benchstat, for summaries and significance testing. benchgate keeps only the policy benchstat deliberately does not provide: thresholds, floors, and a failing exit code for CI. Two behavioral changes fall out, both improvements: - Repeated -count runs are summarized by their median (benchstat semantics) instead of the minimum, and the time gate now also requires the difference to be statistically significant (Mann-Whitney U, p < 0.05) before failing — so a single noisy run cannot fail a PR, while allocs/op and B/op still gate on ratio alone and catch deterministic regressions even from a single run. - Benchmark keys include the package path (pkg config line), so same-named benchmarks in different packages can never merge into one sample. CLI flags, the workflow, and the Makefile target are unchanged; -time-floor-ns is converted internally to benchfmt's tidied sec/op unit. The x/net, x/oauth2, x/crypto, x/exp, and x/text bumps are forced by x/perf's module requirements via MVS. Validation: full -short suite green repo-wide; benchgate verified end-to-end against real bench output for the identical, regressed, and no-baseline cases.

…en warm-noop assertion Review follow-ups on the bench gate (roborev 4813/4814): - A failing merge-base benchmark run could leave partial output from packages that finished before the failure, so the gate would compare against incomplete stale data instead of the documented empty baseline. The workflow now truncates the baseline file on failure. - The gate now runs with a fixed -benchtime=20x iteration count instead of a 500ms duration. Two gated benchmarks grow their fixture as they iterate (appending to a session, adding sessions to a DB), so a duration-based benchtime let a slower side run fewer iterations against a different final size — comparing different workloads. Fixed iterations keep both sides identical; the growth is now documented on both benchmarks (it is deliberate for the append benchmark: flat per-append cost as the session grows is the invariant it protects). - BenchmarkSyncAllWarmNoop now also asserts zero bulk writes via PhaseStats, matching what the docs claimed it self-asserted. - Docs: adding a gated benchmark now includes the BENCH_PACKAGES/Makefile package-list step (a name in the pattern whose package is not in the list silently never runs), and the local run instructions are a complete before/after sequence.

Gating was per-benchmark already — any single benchmark over its threshold fails the PR — but within one benchmark's repeated -count runs, every metric was summarized by its median, so a regression that only manifests in some runs (an intermittent allocation path, a data-dependent branch) could hide behind five clean runs. For allocs/op and B/op that leniency buys nothing: with identical code and a fixed iteration count they are deterministic, so an outlier run is signal, not noise. Those gates now compare the candidate's worst run against the baseline median — one bad run out of six fails. Wall-clock time keeps median-plus-significance, because a single slow run on a shared runner genuinely is noisy-neighbor noise and gating on it would make the workflow flaky.

…op the benchmark allowlist Review follow-ups (roborev 4826/4827/4829/4830) plus a maintenance simplification: - The sec/op gate uses Mann-Whitney U, which can never reach significance with fewer than 5 samples per side — so a small -count silently disabled the time gate while the docs claimed coverage. benchgate now treats that as an explicit configuration error, and the workflow/Makefile document the >= 5 requirement next to the count settings. - The candidate-worst vs baseline-median asymmetry for deterministic metrics is now documented as an intentional candidate-must-be-clean gate, and worst-case report lines include the baseline's worst run so pre-existing instability is visible when reading a failure. Both stale flag help texts (median wording) now describe the worst-run policy. - The local baseline recipe used git stash, which neither moves committed branch work back to a baseline nor removes untracked candidate files — the baseline run could measure candidate code. It now mirrors CI's worktree approach. - Docs note that report identifiers are package-qualified only when the capture carries pkg: metadata (bare-name fallback otherwise), and that qualified and unqualified captures must not be mixed. - Dropped BENCH_PATTERN/BENCH_GATE_PATTERN entirely: the per-name allowlist duplicated information and could silently rot. The gate now runs every benchmark in the gated packages; this is safe because a benchmark with no baseline is reported without gating, so new benchmarks auto-join the gate only once they exist on both sides. BENCH_PACKAGES is the single gate boundary.

roborev-ci · 2026-07-03T16:47:15Z

roborev: Combined Review (`6e01b26`)

No Medium, High, or Critical findings were reported.

Reviewers found no actionable findings at or above Medium severity.

Reviewers: 2 done | Synthesis: codex, 10s | Total: 10m17s

wesm · 2026-07-03T20:30:36Z

looking at this

…ingle source Review fixes for the bench-gate PR: - Silence the sync engine's global logger inside the sync benchmarks: interleaved log.Printf output corrupted the 'BenchmarkX' result lines, so benchfmt parsed zero sync results and all three sync benchmarks silently dropped out of the gate on both sides. - benchgate now treats unparseable result lines as a corrupted capture (report + exit 2) instead of skipping benchfmt syntax errors, and reports a gated unit missing from one side ('missing from baseline/candidate, not gated') instead of a bare continue. Custom ReportMetric units are reported as ungated. - Split the sample-count policy: too few candidate samples stays a config error, but a short baseline (a legitimately partial base run) is reported and not gated. - Print detected regressions even when a config issue forces exit 2, so a real violation is never hidden behind a configuration error. - bench.yml now runs 'make bench-gate' on both sides; the Makefile's BENCH_GATE_PACKAGES/COUNT/TIME are the single source of truth, and each side benchmarks its own commit's package list, so growing the gate cannot empty the baseline. A failing base run degrades to its partial output plus a workflow warning instead of truncating the whole baseline into a vacuous pass. Run steps use set -euo pipefail and the workflow gains a Go-paths filter. - Absorb the leading-edge O(history) signal recompute before the timer in BenchmarkSyncPathsIncrementalAppend (it was ~58% of the measured allocs, masking steady-state regressions), stretch the debounce window so the flush timer cannot fire nondeterministically inside a worst-run-gated loop, and pre-build appended lines. - Build BenchmarkInsertMessagesBatch's 200-message fixture once outside the timed loop; only the SessionID is rewritten per iteration. - Widen testDB to testing.TB, drop the bench-only openBenchDB copy, and fix BenchmarkGetDailyUsage's detached testing.T hack (its require failures and cleanups were wired to nothing). - Split benchgate's compare()/main() into evalGate/compareUnits/ render/parseFlags to meet function-size and complexity limits, and drop the dead NaN guard. - Document the new mechanics in docs/internal/performance-gates.md and reformat it with mdformat --wrap 80.

wesm · 2026-07-03T21:25:21Z

I reviewed this branch and pushed 3109178 with fixes. Summary for reviewers, most severe first:

Bugs fixed

The three sync benchmarks were silently absent from the gate. The engine's log.Printf output interleaves into go test -bench result lines on stdout, so benchfmt could not parse BenchmarkSyncAllWarmNoop, BenchmarkSyncPathsIncrementalAppend, or BenchmarkSyncAllColdArchive. Because the corruption was identical on both sides, they did not even appear as new/removed — the gate's headline benchmarks measured nothing. The benchmarks now silence the global logger (silenceBenchLogs), and benchgate treats unparseable result lines as a corrupted capture: reported with positions, exit 2. Without that second half, the same failure mode could return invisibly.
Any base-side failure emptied the whole baseline, turning the gate into a vacuous pass. The worst trigger was the documented procedure for growing the gate: adding a package to BENCH_PACKAGES made the merge-base run exit 1 ("matched no packages"), truncating the baseline for every package on that very PR. The workflow now runs make bench-gate on both sides — the Makefile is the single source for the package list, sample count, and iteration count — and each side benchmarks its own commit's list, so growing the gate cannot break the base run. A failing base run keeps whatever partial output it produced (plus a ::warning) instead of truncating. To support partial baselines, the sample-count policy is now asymmetric: too few candidate samples is still a config error, but a short baseline is reported and not gated.
Silent gate gaps became report lines. A gated unit missing from one side (e.g. a baseline captured without -benchmem) now reports "missing from baseline, not gated" instead of a bare continue; custom b.ReportMetric units report "has no gate, not gated"; and detected regressions print even when a config issue forces exit 2, so a real violation is never hidden behind a configuration error.
Measurement quality of the append benchmark. The first append after a full sync triggers the leading-edge O(history) signal recompute inside the timed loop — roughly 58% of the measured allocs/op was that one-time cost, so a 1.25x steady-state per-append regression only moved the gated metric ~1.10x and passed. The benchmark now absorbs it before ResetTimer, stretches the debounce window so the 10s flush timer cannot fire nondeterministically inside a worst-run-gated loop, and pre-builds the appended lines. BenchmarkInsertMessagesBatch similarly builds its 200-message fixture once outside the timer (only the SessionID is rewritten per iteration, which allocates nothing).

Cleanups: testDB widened to testing.TB, which removes the bench-only openBenchDB copy and the pre-existing testDB(&testing.T{}) hack in BenchmarkGetDailyUsage (its require failures and cleanups were wired to a detached T that this workflow would have run 12x per PR); benchgate's compare()/main() split into evalGate/compareUnits/render/parseFlags; set -euo pipefail in both run steps; a Go-paths filter so docs/frontend-only PRs skip the job; docs updated to match.

Implications

On this PR itself, the merge base predates the bench-gate Make target, so the baseline step degrades (with a visible warning) and every benchmark reports as "new, no baseline". The gate is fully live from the first PR after this merges.
benchgate exiting 2 on corrupted captures means a future benchmark whose code logs to stdout fails its own PR loudly rather than silently un-gating itself. The docs now tell benchmark authors to silence loggers.

Verified end-to-end: the full gated suite parses all 8 benchmarks, a self-compare exits 0, an injected corrupted line exits 2, and the benchgate/db/sync suites pass.

roborev-ci · 2026-07-03T21:30:51Z

roborev: Combined Review (`3109178`)

Summary verdict: Medium issues remain; no Critical or High findings were reported.

Medium

cmd/benchgate/main.go:311
A gated metric present in the baseline but missing from the candidate is only reported as “not gated” and does not create a config issue, so render can still exit 0. If -benchmem is dropped or the candidate capture loses B/op / allocs/op, CI can silently disable deterministic gates despite the documented exit-2 policy.
Fix: Treat missing candidate gated units as configuration issues so the gate exits 2, while keeping intentionally partial-baseline behavior explicit.
internal/sync/engine_bench_test.go:296
BenchmarkSyncAllColdArchive claims to measure bulk-write ingest but calls SyncAll, which uses syncWriteDefault; the actual writeBatchBulk / DB.WriteSessionBatch path used by resync is not exercised, leaving the documented bulk-ingest regression class ungated.
Fix: Add or change a benchmark to drive syncAllLocked(..., syncWriteBulk, ...), resync, or DB.WriteSessionBatch directly.

Reviewers: 2 done | Synthesis: codex, 9s | Total: 9m25s

…he bulk-write path Addresses the roborev findings on 3109178: - A gated unit present in the baseline but missing from the candidate (e.g. -benchmem dropped) is now a configuration error (exit 2) instead of a report-only line, so a PR cannot silently disable the deterministic gates for itself and every PR after it. A unit missing from the baseline stays reported-and-not-gated, since a partial or older base run is legitimate. - bench.yml evaluates 'make bench-gate-config' on the PR head and passes BENCH_GATE_COUNT/TIME into the merge-base invocation, so a PR that changes those defaults still compares identical workloads (two benchmarks grow their fixture per iteration). The package list intentionally stays per-side. - Add BenchmarkResyncBulkIngest: SyncAll uses the default per-session write path, so the cold-archive benchmark never exercised writeBatchBulk/DB.WriteSessionBatch — the #411 bulk ingest class was ungated. The new benchmark drives syncAllLocked with syncWriteBulk and self-asserts every session went through the batch pipeline; the cold-archive benchmark is documented as covering the default first-sync write path. Both share one benchColdArchive loop. - Update docs/internal/performance-gates.md for the unit-gap policy, the pinned gate config, and the new benchmark.

roborev-ci · 2026-07-04T03:42:53Z

roborev: Combined Review (`6342b2b`)

No issues found.

Reviewers: 2 done | Synthesis: codex | Total: 7m41s

mariusvniekerk added 5 commits July 3, 2026 10:28

wesm merged commit eaa34a2 into main Jul 4, 2026
24 checks passed

wesm deleted the t3code/performance-benchmark-gates branch July 4, 2026 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmark gates for recurring sync and DB performance regressions#985

Add benchmark gates for recurring sync and DB performance regressions#985
wesm merged 7 commits into
mainfrom
t3code/performance-benchmark-gates

mariusvniekerk commented Jul 3, 2026 •

edited by wesm

Loading

Uh oh!

roborev-ci Bot commented Jul 3, 2026

Uh oh!

wesm commented Jul 3, 2026

Uh oh!

wesm commented Jul 3, 2026

Uh oh!

roborev-ci Bot commented Jul 3, 2026

Uh oh!

roborev-ci Bot commented Jul 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mariusvniekerk commented Jul 3, 2026 • edited by wesm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What this adds

Tradeoffs and limitations

Where to look

Uh oh!

roborev-ci Bot commented Jul 3, 2026

roborev: Combined Review (6e01b26)

Uh oh!

wesm commented Jul 3, 2026

Uh oh!

wesm commented Jul 3, 2026

Uh oh!

roborev-ci Bot commented Jul 3, 2026

roborev: Combined Review (3109178)

Medium

Uh oh!

roborev-ci Bot commented Jul 4, 2026

roborev: Combined Review (6342b2b)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

mariusvniekerk commented Jul 3, 2026 •

edited by wesm

Loading

roborev: Combined Review (`6e01b26`)

roborev: Combined Review (`3109178`)

roborev: Combined Review (`6342b2b`)