Add benchmark gates for recurring sync and DB performance regressions#985
Conversation
…essions agentsview has repeatedly shipped the same performance regression shape: sync work that should scale with new data silently starts scaling with archive size, and nothing in CI notices until a user's machine spends minutes in a sync pass (#912 discovery recomputing root project info per source, the provider-migration loss of pre-parse freshness skips, #954's O(history) work per streamed append, #309's per-row json_extract scan). The fixes landed, but each class had no gate, so the next refactor could reintroduce it. This adds a two-layer gate keyed to those historical classes: - Deterministic work-count invariants that run in the normal test suite and are immune to runner noise: a warm full sync over an unchanged Claude archive must skip every session and run zero bulk-write batches (asserted via SyncStats + PhaseStats). This complements the existing Vibe freshness test, which covers the generic providerSourceUnchangedInDB path but not Claude's own pre-parse skip. - Hot-path benchmarks (warm no-op full sync, incremental append into a 1k-message session, cold-archive ingest, streaming chunk-merge replace, batched message insert) plus the existing GetDailyUsage and secrets benchmarks, compared on every PR against the merge base by a new bench.yml workflow. cmd/benchgate gates hard on allocs/op and B/op — deterministic on a fixed runner, and an O(archive)-vs-O(delta) regression always inflates them — while ns/op gets a loose 2x limit so runner noise cannot flake a PR but algorithmic blowups still fail. Benchmarks present on only one side are reported, never gated, so the first PR (empty baseline) and future benchmark additions/removals cannot wedge the gate. Both sides run in one job on one runner to keep the comparison apples to apples; a broken merge base degrades to an empty baseline instead of blocking. The warm-noop and append benchmarks self-assert the invariant they protect (Synced==0, appends absorbed), so a refactor that silently changes what the benchmark measures fails loudly instead of producing a meaningless number. docs/internal/performance-gates.md records the regression history, which gate covers which class, and how to add a benchmark to the pattern (kept in sync between bench.yml and the Makefile bench-gate target). Cross-backend query benchmarks stay in internal/backendbench: they need Docker/testcontainers and answer a different question (backend parity), so they are not part of the per-PR gate.
…f hand-rolled parsing The first benchgate cut reimplemented two things the Go performance tooling already provides: a parser for the benchmark output format and a comparison statistic. Swap both for golang.org/x/perf — benchfmt for parsing (handles the full format spec: config lines, unit tidying, sub-benchmarks, stray non-benchmark output) and benchmath, the statistics engine behind the rewritten benchstat, for summaries and significance testing. benchgate keeps only the policy benchstat deliberately does not provide: thresholds, floors, and a failing exit code for CI. Two behavioral changes fall out, both improvements: - Repeated -count runs are summarized by their median (benchstat semantics) instead of the minimum, and the time gate now also requires the difference to be statistically significant (Mann-Whitney U, p < 0.05) before failing — so a single noisy run cannot fail a PR, while allocs/op and B/op still gate on ratio alone and catch deterministic regressions even from a single run. - Benchmark keys include the package path (pkg config line), so same-named benchmarks in different packages can never merge into one sample. CLI flags, the workflow, and the Makefile target are unchanged; -time-floor-ns is converted internally to benchfmt's tidied sec/op unit. The x/net, x/oauth2, x/crypto, x/exp, and x/text bumps are forced by x/perf's module requirements via MVS. Validation: full -short suite green repo-wide; benchgate verified end-to-end against real bench output for the identical, regressed, and no-baseline cases.
…en warm-noop assertion Review follow-ups on the bench gate (roborev 4813/4814): - A failing merge-base benchmark run could leave partial output from packages that finished before the failure, so the gate would compare against incomplete stale data instead of the documented empty baseline. The workflow now truncates the baseline file on failure. - The gate now runs with a fixed -benchtime=20x iteration count instead of a 500ms duration. Two gated benchmarks grow their fixture as they iterate (appending to a session, adding sessions to a DB), so a duration-based benchtime let a slower side run fewer iterations against a different final size — comparing different workloads. Fixed iterations keep both sides identical; the growth is now documented on both benchmarks (it is deliberate for the append benchmark: flat per-append cost as the session grows is the invariant it protects). - BenchmarkSyncAllWarmNoop now also asserts zero bulk writes via PhaseStats, matching what the docs claimed it self-asserted. - Docs: adding a gated benchmark now includes the BENCH_PACKAGES/Makefile package-list step (a name in the pattern whose package is not in the list silently never runs), and the local run instructions are a complete before/after sequence.
Gating was per-benchmark already — any single benchmark over its threshold fails the PR — but within one benchmark's repeated -count runs, every metric was summarized by its median, so a regression that only manifests in some runs (an intermittent allocation path, a data-dependent branch) could hide behind five clean runs. For allocs/op and B/op that leniency buys nothing: with identical code and a fixed iteration count they are deterministic, so an outlier run is signal, not noise. Those gates now compare the candidate's worst run against the baseline median — one bad run out of six fails. Wall-clock time keeps median-plus-significance, because a single slow run on a shared runner genuinely is noisy-neighbor noise and gating on it would make the workflow flaky.
…op the benchmark allowlist Review follow-ups (roborev 4826/4827/4829/4830) plus a maintenance simplification: - The sec/op gate uses Mann-Whitney U, which can never reach significance with fewer than 5 samples per side — so a small -count silently disabled the time gate while the docs claimed coverage. benchgate now treats that as an explicit configuration error, and the workflow/Makefile document the >= 5 requirement next to the count settings. - The candidate-worst vs baseline-median asymmetry for deterministic metrics is now documented as an intentional candidate-must-be-clean gate, and worst-case report lines include the baseline's worst run so pre-existing instability is visible when reading a failure. Both stale flag help texts (median wording) now describe the worst-run policy. - The local baseline recipe used git stash, which neither moves committed branch work back to a baseline nor removes untracked candidate files — the baseline run could measure candidate code. It now mirrors CI's worktree approach. - Docs note that report identifiers are package-qualified only when the capture carries pkg: metadata (bare-name fallback otherwise), and that qualified and unqualified captures must not be mixed. - Dropped BENCH_PATTERN/BENCH_GATE_PATTERN entirely: the per-name allowlist duplicated information and could silently rot. The gate now runs every benchmark in the gated packages; this is safe because a benchmark with no baseline is reported without gating, so new benchmarks auto-join the gate only once they exist on both sides. BENCH_PACKAGES is the single gate boundary.
roborev: Combined Review (
|
|
looking at this |
…ingle source
Review fixes for the bench-gate PR:
- Silence the sync engine's global logger inside the sync benchmarks:
interleaved log.Printf output corrupted the 'BenchmarkX' result
lines, so benchfmt parsed zero sync results and all three sync
benchmarks silently dropped out of the gate on both sides.
- benchgate now treats unparseable result lines as a corrupted
capture (report + exit 2) instead of skipping benchfmt syntax
errors, and reports a gated unit missing from one side ('missing
from baseline/candidate, not gated') instead of a bare continue.
Custom ReportMetric units are reported as ungated.
- Split the sample-count policy: too few candidate samples stays a
config error, but a short baseline (a legitimately partial base
run) is reported and not gated.
- Print detected regressions even when a config issue forces exit 2,
so a real violation is never hidden behind a configuration error.
- bench.yml now runs 'make bench-gate' on both sides; the Makefile's
BENCH_GATE_PACKAGES/COUNT/TIME are the single source of truth, and
each side benchmarks its own commit's package list, so growing the
gate cannot empty the baseline. A failing base run degrades to its
partial output plus a workflow warning instead of truncating the
whole baseline into a vacuous pass. Run steps use set -euo
pipefail and the workflow gains a Go-paths filter.
- Absorb the leading-edge O(history) signal recompute before the
timer in BenchmarkSyncPathsIncrementalAppend (it was ~58% of the
measured allocs, masking steady-state regressions), stretch the
debounce window so the flush timer cannot fire nondeterministically
inside a worst-run-gated loop, and pre-build appended lines.
- Build BenchmarkInsertMessagesBatch's 200-message fixture once
outside the timed loop; only the SessionID is rewritten per
iteration.
- Widen testDB to testing.TB, drop the bench-only openBenchDB copy,
and fix BenchmarkGetDailyUsage's detached testing.T hack (its
require failures and cleanups were wired to nothing).
- Split benchgate's compare()/main() into evalGate/compareUnits/
render/parseFlags to meet function-size and complexity limits, and
drop the dead NaN guard.
- Document the new mechanics in docs/internal/performance-gates.md
and reformat it with mdformat --wrap 80.
|
I reviewed this branch and pushed 3109178 with fixes. Summary for reviewers, most severe first: Bugs fixed
Cleanups: Implications
Verified end-to-end: the full gated suite parses all 8 benchmarks, a self-compare exits 0, an injected corrupted line exits 2, and the benchgate/db/sync suites pass. |
roborev: Combined Review (
|
…he bulk-write path Addresses the roborev findings on 3109178: - A gated unit present in the baseline but missing from the candidate (e.g. -benchmem dropped) is now a configuration error (exit 2) instead of a report-only line, so a PR cannot silently disable the deterministic gates for itself and every PR after it. A unit missing from the baseline stays reported-and-not-gated, since a partial or older base run is legitimate. - bench.yml evaluates 'make bench-gate-config' on the PR head and passes BENCH_GATE_COUNT/TIME into the merge-base invocation, so a PR that changes those defaults still compares identical workloads (two benchmarks grow their fixture per iteration). The package list intentionally stays per-side. - Add BenchmarkResyncBulkIngest: SyncAll uses the default per-session write path, so the cold-archive benchmark never exercised writeBatchBulk/DB.WriteSessionBatch — the #411 bulk ingest class was ungated. The new benchmark drives syncAllLocked with syncWriteBulk and self-asserts every session went through the batch pipeline; the cold-archive benchmark is documented as covering the default first-sync write path. Both share one benchColdArchive loop. - Update docs/internal/performance-gates.md for the unit-gap policy, the pinned gate config, and the new benchmark.
roborev: Combined Review (
|
Why
agentsview has repeatedly shipped the same performance regression shape: sync work that should scale with new data silently starts scaling with archive size, and nothing in CI notices until a machine spends minutes in a sync pass. Recent history: discovery recomputing root-derived project info per source (#912), the provider migration dropping pre-parse freshness skips, O(session-history) work on every streamed append (#954), bulk-ingest throughput (#411), and per-row
json_extractin the usage scan (#309). Each fix landed without a gate, so any refactor could quietly reintroduce its class.What this adds
Layer 1 — deterministic work-count invariants in the normal test suite (immune to runner noise):
TestWarmFullSyncDoesNoBulkWriteWorkasserts a second full sync over an unchanged Claude archive skips every session and runs zero bulk-write batches, via the existingSyncStats/PhaseStatscounters. It complements the existing Vibe freshness, signal-scheduler debounce, and parser once-per-root seam tests;docs/internal/performance-gates.mdmaps each historical regression class to the gate that covers it.Layer 2 — a per-PR benchmark gate (
.github/workflows/bench.yml): hot-path benchmarks (warm no-op full sync, incremental append into a 1k-message session, cold-archive ingest through both the default and the resync bulk-write pipelines, streaming chunk-merge replace, batched insert, plus the existing usage and secret-scan benchmarks) run on the PR head and its merge base in the same job on the same runner, each side viamake bench-gate— the Makefile is the single source of truth for the gated package list, sample count, and iteration count — and are compared bycmd/benchgate.benchgatebuilds ongolang.org/x/perf(benchfmtparsing,benchmath— the statistics behind benchstat) and adds only the policy benchstat doesn't provide: thresholds, floors, and a failing exit code. The policy is tuned for shared-runner noise:allocs/op(1.25x) andB/op(1.35x) are deterministic for fixed code and iteration count, so they gate the candidate's worst-countrun against the baseline median — a single outlier run is a real intermittent allocation path and fails. Failure lines include the baseline's worst run so pre-existing instability is visible.make bench-gate, a PR that growsBENCH_GATE_PACKAGEScannot break the base run.Benchmarkline) fail loudly with exit 2 instead of silently dropping those benchmarks from both sides; the sync benchmarks silence the engine's logger for exactly this reason. A gated unit the baseline has but the candidate lost (e.g.-benchmemdropped) is also a configuration error; a unit missing from the baseline, and customb.ReportMetricunits, are reported as not gated.-benchtime=20xso baseline and candidate measure identical workloads (two benchmarks deliberately grow their fixture per iteration); the count and benchtime are evaluated from the PR head's Makefile and passed into the merge-base run so changing them cannot skew the comparison. Benchmarks keep fixture construction and one-time leading-edge work out of the timed region so the gated ratios measure product cost, not test-helper cost.Tradeoffs and limitations
bench-gateMake target, so the baseline step degrades to empty (with a visible warning) and every benchmark reports as new; the gate is fully live from the first PR after this merges.heavy-angorabranch) is not gated yet; a gate would fail against current main until that lands.golang.org/x/perf's module requirements force minor bumps ofx/net,x/oauth2,x/crypto,x/exp, andx/textvia MVS.go.mod/go.sum, the Makefile, or the workflow itself; docs- and frontend-only PRs skip it. Cross-backend query benchmarks stay in the opt-ininternal/backendbench(Docker) and are not part of the gate.Where to look
cmd/benchgate/— the comparison policy (worst-run gating, significance requirement, corrupted-capture and missing-unit reporting).github/workflows/bench.yml— base-vs-head orchestration and partial-baseline degradationinternal/sync/engine_bench_test.go,internal/db/messages_bench_test.go,internal/sync/perf_invariant_test.godocs/internal/performance-gates.md— regression history and how to add a gated benchmark🤖 Generated with Claude Code
generated by a clanker