Skip to content

feat(harness): lb scenario tier — simulate tier/failover/breaker/affinity#1224

Open
0x0079 wants to merge 1 commit into
mainfrom
claude/lb-scenario-harness
Open

feat(harness): lb scenario tier — simulate tier/failover/breaker/affinity#1224
0x0079 wants to merge 1 commit into
mainfrom
claude/lb-scenario-harness

Conversation

@0x0079

@0x0079 0x0079 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Follow-up to #1223, now rebased onto main (the core affinity/tier fix merged in #1223). The diff is harness-only.

What

Our tests could not realistically verify load-balancing dynamics. This adds a scenario simulator and a harness lb CLI tier so tier selection, mid-request failover, the circuit breaker (trip + timed recovery), health-based exclusion, and session-affinity pin movement can be driven against programmable fake upstreams over a request sequence — and watched request by request.

  • internal/server/lbsim.go — shared engine LBSimulator, driving the real ServiceSelector.Select → dispatchWithPriorityFailover path with a deterministic clock. Each attempt feeds both production feedback channels exactly as a real request would: the breaker recorder and Server.reportHealthStatus (status-classified: 429 → rate-limit, 401/403 → immediate auth-unhealthy, 5xx/other → 3-strike).
  • Shared clock seamSetClock/nowFn in loadbalance (breaker.go + health_monitor.go) and routing (affinity TTL). One fake clock drives breaker recovery, health recovery, and strict affinity-TTL expiry together. Production is unchanged (defaults to time.Now).
  • cli/harness lb--file scenario.yaml or --example cascade|flat|grid|single|regression|ratelimit|authflip|crossmodel. Default output is a pencil graph (per-request failover hops + each svc's breaker/health + affinity pin); --table and --json also available.

Alignment with the merged load-balance fixes

Rebasing surfaced two behaviours that landed alongside #1223/#1233; the simulator now models both:

Example

#3  s1   t0/gpt-4 ✗500  →  t1/gpt-4 ✓200   →  client=200
       state: t0/gpt-4=open/unhealthy   t1/gpt-4=closed/healthy   pin=t0/gpt-4
#4  s1   t1/gpt-4 ✓200   →  client=200
       state: t0/gpt-4=open/unhealthy   t1/gpt-4=closed/healthy   pin=t1/gpt-4

Tests

internal/server/lb_scenario_test.go — A/B/C/D shapes + the original regression + rate-limit/auth-error + cross-model + strict-TTL scenarios, all asserting the captured trace, affinity pin, and breaker/health snapshots. go build ./..., go vet, and the loadbalance / typ / routing / server / harness packages are green.

Known gap (deferred)

G1 — horizontal tactics (random/token/…) are breaker-blind at the selection layer; documented in priority-routing.md and marked as an executable t.Skip in the harness.

🤖 Generated with Claude Code

https://claude.ai/code/session_01MCtGUNwURzSk34PQ8gkjZC

@0x0079 0x0079 force-pushed the claude/lb-scenario-harness branch 5 times, most recently from 6e72508 to be55eb7 Compare June 17, 2026 13:51
@0x0079 0x0079 changed the base branch from claude/tier-affinity-fix to main June 17, 2026 13:51
@FFengIll FFengIll force-pushed the claude/lb-scenario-harness branch from be55eb7 to 245767b Compare June 18, 2026 04:51
…nity

Follow-up to the affinity/tier fix. Adds a load-balancing scenario simulator
and a `harness lb` CLI tier on top of it, so routing dynamics (tier
selection, mid-request failover, breaker trip + timed recovery, health-based
exclusion, affinity pin movement) can be driven against programmable fake
upstreams over a request sequence and watched request-by-request.

- internal/server/lbsim.go: shared engine (LBSimulator) driving the real
  ServiceSelector.Select -> dispatchWithPriorityFailover path with a
  deterministic breaker clock; feeds both production feedback channels
  (breaker recorder + Server.reportHealthStatus) per status, faithfully.
- loadbalance: SetClock/nowFn clock seam (breaker.go + health_monitor.go,
  same package) so one sim clock advance recovers both channels; production
  unchanged (defaults to time.Now).
- cli/harness lb: YAML/`--example` scenarios; default pencil-graph output
  (per-request hops + svc breaker/health/pin), `--table`, `--json`.
- .design/priority-routing.pencil.md: 500-retry worked-example graph; README
  + the two-feedback-channels note.
@FFengIll FFengIll force-pushed the claude/lb-scenario-harness branch from 245767b to 6ff2ba1 Compare June 18, 2026 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants