Skip to content

Add benchmark gates for recurring sync and DB performance regressions#985

Merged
wesm merged 7 commits into
mainfrom
t3code/performance-benchmark-gates
Jul 4, 2026
Merged

Add benchmark gates for recurring sync and DB performance regressions#985
wesm merged 7 commits into
mainfrom
t3code/performance-benchmark-gates

Conversation

@mariusvniekerk

@mariusvniekerk mariusvniekerk commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Why

agentsview has repeatedly shipped the same performance regression shape: sync work that should scale with new data silently starts scaling with archive size, and nothing in CI notices until a machine spends minutes in a sync pass. Recent history: discovery recomputing root-derived project info per source (#912), the provider migration dropping pre-parse freshness skips, O(session-history) work on every streamed append (#954), bulk-ingest throughput (#411), and per-row json_extract in the usage scan (#309). Each fix landed without a gate, so any refactor could quietly reintroduce its class.

What this adds

Layer 1 — deterministic work-count invariants in the normal test suite (immune to runner noise): TestWarmFullSyncDoesNoBulkWriteWork asserts a second full sync over an unchanged Claude archive skips every session and runs zero bulk-write batches, via the existing SyncStats/PhaseStats counters. It complements the existing Vibe freshness, signal-scheduler debounce, and parser once-per-root seam tests; docs/internal/performance-gates.md maps each historical regression class to the gate that covers it.

Layer 2 — a per-PR benchmark gate (.github/workflows/bench.yml): hot-path benchmarks (warm no-op full sync, incremental append into a 1k-message session, cold-archive ingest through both the default and the resync bulk-write pipelines, streaming chunk-merge replace, batched insert, plus the existing usage and secret-scan benchmarks) run on the PR head and its merge base in the same job on the same runner, each side via make bench-gate — the Makefile is the single source of truth for the gated package list, sample count, and iteration count — and are compared by cmd/benchgate.

benchgate builds on golang.org/x/perf (benchfmt parsing, benchmath — the statistics behind benchstat) and adds only the policy benchstat doesn't provide: thresholds, floors, and a failing exit code. The policy is tuned for shared-runner noise:

  • Gating is per benchmark; nothing is averaged across benchmarks.
  • allocs/op (1.25x) and B/op (1.35x) are deterministic for fixed code and iteration count, so they gate the candidate's worst -count run against the baseline median — a single outlier run is a real intermittent allocation path and fails. Failure lines include the baseline's worst run so pre-existing instability is visible.
  • Time gates medians at a loose 2.0x and requires Mann-Whitney significance, so one slow run can't flake a PR but algorithmic blowups fail. Fewer than 5 candidate samples is a loud configuration error; a baseline with fewer than 5 samples (a legitimately partial base run) is reported and not gated.
  • Benchmarks present on only one side never gate, which makes the gate allowlist-free: every benchmark in the gated packages runs, and new ones auto-join once they exist on both sides. Because each side runs its own commit's make bench-gate, a PR that grows BENCH_GATE_PACKAGES cannot break the base run.
  • A failing merge-base run degrades to whatever partial output it produced, plus a workflow warning — one broken package does not erase the baseline for the others.
  • Captures with unparseable result lines (for example log output interleaved into a Benchmark line) fail loudly with exit 2 instead of silently dropping those benchmarks from both sides; the sync benchmarks silence the engine's logger for exactly this reason. A gated unit the baseline has but the candidate lost (e.g. -benchmem dropped) is also a configuration error; a unit missing from the baseline, and custom b.ReportMetric units, are reported as not gated.
  • The gate runs a fixed -benchtime=20x so baseline and candidate measure identical workloads (two benchmarks deliberately grow their fixture per iteration); the count and benchtime are evaluated from the PR head's Makefile and passed into the merge-base run so changing them cannot skew the comparison. Benchmarks keep fixture construction and one-time leading-edge work out of the timed region so the gated ratios measure product cost, not test-helper cost.

Tradeoffs and limitations

  • Subtle wall-clock regressions (< 2x or statistically ambiguous) are deliberately not time-gated; the allocation metrics act as the sensitive proxy for O(archive)-vs-O(delta) mistakes.
  • On this PR itself, the merge base predates the bench-gate Make target, so the baseline step degrades to empty (with a visible warning) and every benchmark reports as new; the gate is fully live from the first PR after this merges.
  • The resync sanitize-pass class (fixed on the unmerged heavy-angora branch) is not gated yet; a gate would fail against current main until that lands.
  • golang.org/x/perf's module requirements force minor bumps of x/net, x/oauth2, x/crypto, x/exp, and x/text via MVS.
  • The workflow runs only on PRs that touch Go files, go.mod/go.sum, the Makefile, or the workflow itself; docs- and frontend-only PRs skip it. Cross-backend query benchmarks stay in the opt-in internal/backendbench (Docker) and are not part of the gate.

Where to look

  • cmd/benchgate/ — the comparison policy (worst-run gating, significance requirement, corrupted-capture and missing-unit reporting)
  • .github/workflows/bench.yml — base-vs-head orchestration and partial-baseline degradation
  • internal/sync/engine_bench_test.go, internal/db/messages_bench_test.go, internal/sync/perf_invariant_test.go
  • docs/internal/performance-gates.md — regression history and how to add a gated benchmark

🤖 Generated with Claude Code

generated by a clanker

…essions

agentsview has repeatedly shipped the same performance regression shape:
sync work that should scale with new data silently starts scaling with
archive size, and nothing in CI notices until a user's machine spends
minutes in a sync pass (#912 discovery recomputing root project info per
source, the provider-migration loss of pre-parse freshness skips, #954's
O(history) work per streamed append, #309's per-row json_extract scan).
The fixes landed, but each class had no gate, so the next refactor could
reintroduce it.

This adds a two-layer gate keyed to those historical classes:

- Deterministic work-count invariants that run in the normal test suite
  and are immune to runner noise: a warm full sync over an unchanged
  Claude archive must skip every session and run zero bulk-write batches
  (asserted via SyncStats + PhaseStats). This complements the existing
  Vibe freshness test, which covers the generic
  providerSourceUnchangedInDB path but not Claude's own pre-parse skip.

- Hot-path benchmarks (warm no-op full sync, incremental append into a
  1k-message session, cold-archive ingest, streaming chunk-merge
  replace, batched message insert) plus the existing GetDailyUsage and
  secrets benchmarks, compared on every PR against the merge base by a
  new bench.yml workflow. cmd/benchgate gates hard on allocs/op and B/op
  — deterministic on a fixed runner, and an O(archive)-vs-O(delta)
  regression always inflates them — while ns/op gets a loose 2x limit so
  runner noise cannot flake a PR but algorithmic blowups still fail.
  Benchmarks present on only one side are reported, never gated, so the
  first PR (empty baseline) and future benchmark additions/removals
  cannot wedge the gate. Both sides run in one job on one runner to keep
  the comparison apples to apples; a broken merge base degrades to an
  empty baseline instead of blocking.

The warm-noop and append benchmarks self-assert the invariant they
protect (Synced==0, appends absorbed), so a refactor that silently
changes what the benchmark measures fails loudly instead of producing a
meaningless number.

docs/internal/performance-gates.md records the regression history, which
gate covers which class, and how to add a benchmark to the pattern (kept
in sync between bench.yml and the Makefile bench-gate target).

Cross-backend query benchmarks stay in internal/backendbench: they need
Docker/testcontainers and answer a different question (backend parity),
so they are not part of the per-PR gate.
…f hand-rolled parsing

The first benchgate cut reimplemented two things the Go performance
tooling already provides: a parser for the benchmark output format and
a comparison statistic. Swap both for golang.org/x/perf — benchfmt for
parsing (handles the full format spec: config lines, unit tidying,
sub-benchmarks, stray non-benchmark output) and benchmath, the
statistics engine behind the rewritten benchstat, for summaries and
significance testing. benchgate keeps only the policy benchstat
deliberately does not provide: thresholds, floors, and a failing exit
code for CI.

Two behavioral changes fall out, both improvements:

- Repeated -count runs are summarized by their median (benchstat
  semantics) instead of the minimum, and the time gate now also
  requires the difference to be statistically significant
  (Mann-Whitney U, p < 0.05) before failing — so a single noisy run
  cannot fail a PR, while allocs/op and B/op still gate on ratio alone
  and catch deterministic regressions even from a single run.
- Benchmark keys include the package path (pkg config line), so
  same-named benchmarks in different packages can never merge into one
  sample.

CLI flags, the workflow, and the Makefile target are unchanged;
-time-floor-ns is converted internally to benchfmt's tidied sec/op
unit. The x/net, x/oauth2, x/crypto, x/exp, and x/text bumps are
forced by x/perf's module requirements via MVS.

Validation: full -short suite green repo-wide; benchgate verified
end-to-end against real bench output for the identical, regressed, and
no-baseline cases.
…en warm-noop assertion

Review follow-ups on the bench gate (roborev 4813/4814):

- A failing merge-base benchmark run could leave partial output from
  packages that finished before the failure, so the gate would compare
  against incomplete stale data instead of the documented empty
  baseline. The workflow now truncates the baseline file on failure.

- The gate now runs with a fixed -benchtime=20x iteration count
  instead of a 500ms duration. Two gated benchmarks grow their fixture
  as they iterate (appending to a session, adding sessions to a DB),
  so a duration-based benchtime let a slower side run fewer iterations
  against a different final size — comparing different workloads.
  Fixed iterations keep both sides identical; the growth is now
  documented on both benchmarks (it is deliberate for the append
  benchmark: flat per-append cost as the session grows is the
  invariant it protects).

- BenchmarkSyncAllWarmNoop now also asserts zero bulk writes via
  PhaseStats, matching what the docs claimed it self-asserted.

- Docs: adding a gated benchmark now includes the
  BENCH_PACKAGES/Makefile package-list step (a name in the pattern
  whose package is not in the list silently never runs), and the local
  run instructions are a complete before/after sequence.
Gating was per-benchmark already — any single benchmark over its
threshold fails the PR — but within one benchmark's repeated -count
runs, every metric was summarized by its median, so a regression that
only manifests in some runs (an intermittent allocation path, a
data-dependent branch) could hide behind five clean runs.

For allocs/op and B/op that leniency buys nothing: with identical code
and a fixed iteration count they are deterministic, so an outlier run
is signal, not noise. Those gates now compare the candidate's worst
run against the baseline median — one bad run out of six fails.
Wall-clock time keeps median-plus-significance, because a single slow
run on a shared runner genuinely is noisy-neighbor noise and gating on
it would make the workflow flaky.
…op the benchmark allowlist

Review follow-ups (roborev 4826/4827/4829/4830) plus a maintenance
simplification:

- The sec/op gate uses Mann-Whitney U, which can never reach
  significance with fewer than 5 samples per side — so a small -count
  silently disabled the time gate while the docs claimed coverage.
  benchgate now treats that as an explicit configuration error, and
  the workflow/Makefile document the >= 5 requirement next to the
  count settings.

- The candidate-worst vs baseline-median asymmetry for deterministic
  metrics is now documented as an intentional candidate-must-be-clean
  gate, and worst-case report lines include the baseline's worst run
  so pre-existing instability is visible when reading a failure. Both
  stale flag help texts (median wording) now describe the worst-run
  policy.

- The local baseline recipe used git stash, which neither moves
  committed branch work back to a baseline nor removes untracked
  candidate files — the baseline run could measure candidate code. It
  now mirrors CI's worktree approach.

- Docs note that report identifiers are package-qualified only when
  the capture carries pkg: metadata (bare-name fallback otherwise),
  and that qualified and unqualified captures must not be mixed.

- Dropped BENCH_PATTERN/BENCH_GATE_PATTERN entirely: the per-name
  allowlist duplicated information and could silently rot. The gate
  now runs every benchmark in the gated packages; this is safe
  because a benchmark with no baseline is reported without gating, so
  new benchmarks auto-join the gate only once they exist on both
  sides. BENCH_PACKAGES is the single gate boundary.
@roborev-ci

roborev-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown

roborev: Combined Review (6e01b26)

No Medium, High, or Critical findings were reported.

Reviewers found no actionable findings at or above Medium severity.


Reviewers: 2 done | Synthesis: codex, 10s | Total: 10m17s

@wesm

wesm commented Jul 3, 2026

Copy link
Copy Markdown
Member

looking at this

…ingle source

Review fixes for the bench-gate PR:

- Silence the sync engine's global logger inside the sync benchmarks:
  interleaved log.Printf output corrupted the 'BenchmarkX' result
  lines, so benchfmt parsed zero sync results and all three sync
  benchmarks silently dropped out of the gate on both sides.
- benchgate now treats unparseable result lines as a corrupted
  capture (report + exit 2) instead of skipping benchfmt syntax
  errors, and reports a gated unit missing from one side ('missing
  from baseline/candidate, not gated') instead of a bare continue.
  Custom ReportMetric units are reported as ungated.
- Split the sample-count policy: too few candidate samples stays a
  config error, but a short baseline (a legitimately partial base
  run) is reported and not gated.
- Print detected regressions even when a config issue forces exit 2,
  so a real violation is never hidden behind a configuration error.
- bench.yml now runs 'make bench-gate' on both sides; the Makefile's
  BENCH_GATE_PACKAGES/COUNT/TIME are the single source of truth, and
  each side benchmarks its own commit's package list, so growing the
  gate cannot empty the baseline. A failing base run degrades to its
  partial output plus a workflow warning instead of truncating the
  whole baseline into a vacuous pass. Run steps use set -euo
  pipefail and the workflow gains a Go-paths filter.
- Absorb the leading-edge O(history) signal recompute before the
  timer in BenchmarkSyncPathsIncrementalAppend (it was ~58% of the
  measured allocs, masking steady-state regressions), stretch the
  debounce window so the flush timer cannot fire nondeterministically
  inside a worst-run-gated loop, and pre-build appended lines.
- Build BenchmarkInsertMessagesBatch's 200-message fixture once
  outside the timed loop; only the SessionID is rewritten per
  iteration.
- Widen testDB to testing.TB, drop the bench-only openBenchDB copy,
  and fix BenchmarkGetDailyUsage's detached testing.T hack (its
  require failures and cleanups were wired to nothing).
- Split benchgate's compare()/main() into evalGate/compareUnits/
  render/parseFlags to meet function-size and complexity limits, and
  drop the dead NaN guard.
- Document the new mechanics in docs/internal/performance-gates.md
  and reformat it with mdformat --wrap 80.
@wesm

wesm commented Jul 3, 2026

Copy link
Copy Markdown
Member

I reviewed this branch and pushed 3109178 with fixes. Summary for reviewers, most severe first:

Bugs fixed

  1. The three sync benchmarks were silently absent from the gate. The engine's log.Printf output interleaves into go test -bench result lines on stdout, so benchfmt could not parse BenchmarkSyncAllWarmNoop, BenchmarkSyncPathsIncrementalAppend, or BenchmarkSyncAllColdArchive. Because the corruption was identical on both sides, they did not even appear as new/removed — the gate's headline benchmarks measured nothing. The benchmarks now silence the global logger (silenceBenchLogs), and benchgate treats unparseable result lines as a corrupted capture: reported with positions, exit 2. Without that second half, the same failure mode could return invisibly.

  2. Any base-side failure emptied the whole baseline, turning the gate into a vacuous pass. The worst trigger was the documented procedure for growing the gate: adding a package to BENCH_PACKAGES made the merge-base run exit 1 ("matched no packages"), truncating the baseline for every package on that very PR. The workflow now runs make bench-gate on both sides — the Makefile is the single source for the package list, sample count, and iteration count — and each side benchmarks its own commit's list, so growing the gate cannot break the base run. A failing base run keeps whatever partial output it produced (plus a ::warning) instead of truncating. To support partial baselines, the sample-count policy is now asymmetric: too few candidate samples is still a config error, but a short baseline is reported and not gated.

  3. Silent gate gaps became report lines. A gated unit missing from one side (e.g. a baseline captured without -benchmem) now reports "missing from baseline, not gated" instead of a bare continue; custom b.ReportMetric units report "has no gate, not gated"; and detected regressions print even when a config issue forces exit 2, so a real violation is never hidden behind a configuration error.

  4. Measurement quality of the append benchmark. The first append after a full sync triggers the leading-edge O(history) signal recompute inside the timed loop — roughly 58% of the measured allocs/op was that one-time cost, so a 1.25x steady-state per-append regression only moved the gated metric ~1.10x and passed. The benchmark now absorbs it before ResetTimer, stretches the debounce window so the 10s flush timer cannot fire nondeterministically inside a worst-run-gated loop, and pre-builds the appended lines. BenchmarkInsertMessagesBatch similarly builds its 200-message fixture once outside the timer (only the SessionID is rewritten per iteration, which allocates nothing).

Cleanups: testDB widened to testing.TB, which removes the bench-only openBenchDB copy and the pre-existing testDB(&testing.T{}) hack in BenchmarkGetDailyUsage (its require failures and cleanups were wired to a detached T that this workflow would have run 12x per PR); benchgate's compare()/main() split into evalGate/compareUnits/render/parseFlags; set -euo pipefail in both run steps; a Go-paths filter so docs/frontend-only PRs skip the job; docs updated to match.

Implications

  • On this PR itself, the merge base predates the bench-gate Make target, so the baseline step degrades (with a visible warning) and every benchmark reports as "new, no baseline". The gate is fully live from the first PR after this merges.
  • benchgate exiting 2 on corrupted captures means a future benchmark whose code logs to stdout fails its own PR loudly rather than silently un-gating itself. The docs now tell benchmark authors to silence loggers.

Verified end-to-end: the full gated suite parses all 8 benchmarks, a self-compare exits 0, an injected corrupted line exits 2, and the benchgate/db/sync suites pass.

@roborev-ci

roborev-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown

roborev: Combined Review (3109178)

Summary verdict: Medium issues remain; no Critical or High findings were reported.

Medium

  • cmd/benchgate/main.go:311
    A gated metric present in the baseline but missing from the candidate is only reported as “not gated” and does not create a config issue, so render can still exit 0. If -benchmem is dropped or the candidate capture loses B/op / allocs/op, CI can silently disable deterministic gates despite the documented exit-2 policy.
    Fix: Treat missing candidate gated units as configuration issues so the gate exits 2, while keeping intentionally partial-baseline behavior explicit.

  • internal/sync/engine_bench_test.go:296
    BenchmarkSyncAllColdArchive claims to measure bulk-write ingest but calls SyncAll, which uses syncWriteDefault; the actual writeBatchBulk / DB.WriteSessionBatch path used by resync is not exercised, leaving the documented bulk-ingest regression class ungated.
    Fix: Add or change a benchmark to drive syncAllLocked(..., syncWriteBulk, ...), resync, or DB.WriteSessionBatch directly.


Reviewers: 2 done | Synthesis: codex, 9s | Total: 9m25s

…he bulk-write path

Addresses the roborev findings on 3109178:

- A gated unit present in the baseline but missing from the
  candidate (e.g. -benchmem dropped) is now a configuration error
  (exit 2) instead of a report-only line, so a PR cannot silently
  disable the deterministic gates for itself and every PR after it.
  A unit missing from the baseline stays reported-and-not-gated,
  since a partial or older base run is legitimate.
- bench.yml evaluates 'make bench-gate-config' on the PR head and
  passes BENCH_GATE_COUNT/TIME into the merge-base invocation, so a
  PR that changes those defaults still compares identical workloads
  (two benchmarks grow their fixture per iteration). The package
  list intentionally stays per-side.
- Add BenchmarkResyncBulkIngest: SyncAll uses the default
  per-session write path, so the cold-archive benchmark never
  exercised writeBatchBulk/DB.WriteSessionBatch — the #411 bulk
  ingest class was ungated. The new benchmark drives syncAllLocked
  with syncWriteBulk and self-asserts every session went through the
  batch pipeline; the cold-archive benchmark is documented as
  covering the default first-sync write path. Both share one
  benchColdArchive loop.
- Update docs/internal/performance-gates.md for the unit-gap
  policy, the pinned gate config, and the new benchmark.
@roborev-ci

roborev-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

roborev: Combined Review (6342b2b)

No issues found.


Reviewers: 2 done | Synthesis: codex | Total: 7m41s

@wesm wesm merged commit eaa34a2 into main Jul 4, 2026
24 checks passed
@wesm wesm deleted the t3code/performance-benchmark-gates branch July 4, 2026 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants