Skip to content

Support Circut break#17

Open
mwfj wants to merge 29 commits intomainfrom
support-circut-break
Open

Support Circut break#17
mwfj wants to merge 29 commits intomainfrom
support-circut-break

Conversation

@mwfj
Copy link
Copy Markdown
Owner

@mwfj mwfj commented Apr 13, 2026

Summary

Adds per-upstream circuit breaking to the gateway, preventing cascading failures when a backend becomes unhealthy. Tracks upstream failures on a resilience4j-style three-state machine (CLOSED → OPEN → HALF_OPEN → CLOSED), trips on either consecutive-failure or failure-rate thresholds, and short-circuits checkouts with 503 Service Unavailable while the circuit is OPEN. A separate retry budget caps the fraction of concurrent upstream work that may be retries, bounding the retry-storm amplification factor even when individual retries pass the breaker gate.


What's in this PR

Config schema

  • CircuitBreakerConfig struct in include/config/server_config.h (12 fields: enabled, dry_run, thresholds, window, half-open budget, open-duration bounds, retry-budget tuning).
  • Nested into UpstreamConfig. The UpstreamConfig::operator== equality operator EXCLUDES circuit_breaker because those fields are live-reloadable — topology fields (name, host, port, tls, pool, proxy) remain restart-only.
  • JSON parse with strict per-field type validation (is_number_integer / is_boolean — rejects 1.9 → 1 and true → 1 silent coercions).
  • 13 validation rules in ConfigLoader::Validate, one invalid_argument per rule, upper bounds on consecutive_failure_threshold (≤10k), minimum_volume (≤10M), permitted_half_open_calls (≤1k).
  • Round-trip ToJson serialization.

Core state machine + sliding window

include/circuit_breaker/:

  • circuit_breaker_state.hState, Decision { ADMITTED, ADMITTED_PROBE, REJECTED_OPEN, REJECTED_OPEN_DRYRUN }, FailureKind, StateTransitionCallback.
  • circuit_breaker_window.h/.cc — time-bucketed sliding window (ring of per-second buckets, lazy advance, dispatcher-thread-local, no locks). Constructor clamps non-positive window_seconds to 1.
  • circuit_breaker_slice.h/.cc — per-dispatcher breaker slice with:
    • Dual trip paths (consecutive-failure OR rate-with-min-volume).
    • Lazy OPEN → HALF_OPEN on next TryAcquire.
    • Exponential open duration (base << consecutive_trips, capped at max); ComputeOpenDuration clamps non-positive / inverted bounds at use.
    • Bounded HALF_OPEN probes via half_open_admitted_ (monotone per-cycle counter, not inflight — prevents slot-reuse after one probe completes).
    • Snapshot of permitted_half_open_calls at cycle entry so a mid-cycle reload can't change the budget for the running cycle.
    • Dry-run mode (returns REJECTED_OPEN_DRYRUN; caller proceeds).
    • Generation tokens split by admission domain (closed_gen_ / halfopen_gen_) — stale reports drop silently; window-resize bump doesn't strand in-flight probes.
    • TryAcquire() returns Admission { Decision, uint64_t generation }; Report{Success,Failure,Neutral} takes the admission generation.
    • ReportNeutral — slot-release path for admissions that terminate locally (POOL_EXHAUSTED, shutdown, client disconnect) without counting as success or failure.
    • Config hot-reload preserving live state on threshold-only edits; full reset on enabled toggle; window-resize also resets consecutive_failures_.
    • Disabled fast path: single if (!config_.enabled) return ADMITTED; early return, zero atomic traffic when off.
    • Time source injection for deterministic tests.
    • Public accessors: IsOpenDeadlineSet(), config(), NextOpenDurationMs().

Host / Manager / RetryBudget

  • retry_budget.h/.ccRetryBudget class. RAII InFlightGuard for per-attempt bookkeeping. CAS loop TryConsumeRetry (concurrent retries cannot race past the cap) with a non-retry denominator (cap = max(min_conc, (in_flight - retries_in_flight) * percent / 100)), so in steady state the effective retry fraction matches the configured percent rather than drifting above it. ComputeCap() observability accessor.
  • circuit_breaker_host.h/.ccCircuitBreakerHost owns N slices (one per dispatcher partition) + one shared RetryBudget. Snapshot() aggregates per-slice counters + retry-budget state. Reload() fans out per-slice Slice::Reload calls via Dispatcher::EnQueue. host_label format: service=<svc> host=<h>:<p> partition=<i>.
  • circuit_breaker_manager.h/.ccCircuitBreakerManager keyed by service name. Topology stable post-construction (lock-free GetHost). Constructor validates dispatcher-count vs config partition-count mismatch (throws; skipped when dispatchers is empty for unit-test paths). Reload() serialized by mutex.

Hot-path integration — ProxyTransaction + UpstreamManager + HttpServer

Ownership & wiring:

  • HttpServer::circuit_breaker_manager_ — declared AFTER upstream_manager_ so destruction runs breaker-first.
  • UpstreamManager::AttachCircuitBreakerManager(raw*) — atomic non-owning pointer (release/acquire).
  • HttpServer::MarkServerReady installs a per-slice transition callback capturing (service, dispatcher_index). Fires only on CLOSED→OPEN. Wired for ALL upstreams regardless of enabled so live reload from enabled=false→true works without re-wiring.

Result codes:

  • PoolPartition::CHECKOUT_CIRCUIT_OPEN = -6 (delivered to wait-queue waiters drained on a breaker trip).
  • ProxyTransaction::RESULT_CIRCUIT_OPEN = -7, RESULT_RETRY_BUDGET_EXHAUSTED = -8.

ProxyTransaction hot-path changes:

  • slice_ + retry_budget_ resolved once at Start().
  • AttemptCheckout calls ConsultBreaker() at the top; each attempt (first + retries) gets a fresh admission stamped with the slice's current generation.
  • inflight_guard_ (RAII) replaced on every AttemptCheckout — stays at exactly one in_flight unit per transaction.
  • MaybeRetry calls TryConsumeRetry before committing to the retry; exhausted → DeliverResponse(MakeRetryBudgetResponse()) (terminal, not reported).
  • ReportBreakerOutcome(result_code) classifies per design §7 and fires BEFORE MaybeRetry at every failure site so the retry's fresh ConsultBreaker sees the latest count.
  • ReleaseBreakerAdmissionNeutral() in Cancel() — client-disconnect always neutral (replacement probe slot acceptable; tripping a healthy backend on user-side abandonment would be a DOS vector).

Response factories:

  • MakeCircuitOpenResponse() — state-aware Retry-After: OPEN reads stored slice->OpenUntil(); HALF_OPEN uses slice->NextOpenDurationMs() (exponential-backoff aware). Ceil division. Absolute cap 3600s. X-Circuit-Breaker: open|half_open label (distinguishes the two reject paths).
  • MakeRetryBudgetResponse() — 503 + X-Retry-Budget-Exhausted: 1 + Connection: close. No Retry-After (budget has no recovery clock).

Wait-queue drain on trip

  • PoolPartition::DrainWaitQueueOnTrip() — dispatcher-thread. Iterates wait_queue_, pops each entry, fires error_callback(CHECKOUT_CIRCUIT_OPEN) on non-cancelled waiters. Skips if shutting_down_ (InitiateShutdown is already draining with CHECKOUT_SHUTTING_DOWN). Hoists alive_ against teardown re-entry. Does NOT set shutting_down_ — this is a transient drain; the partition keeps its connections for HALF_OPEN probing.
  • UpstreamManager::GetPoolPartition(service, index) accessor.

Observability

All events surface through structured logs + a snapshot API. Full log catalog is in docs/circuit_breaker.md §Observability; highlights:

  • CLOSED → OPEN trip at warn: trigger, consecutive_failures, window_total, window_fail_rate, open_for_ms, consecutive_trips (captured pre-reset so operators can distinguish a consecutive trip from a rate trip).
  • OPEN → HALF_OPEN / HALF_OPEN → CLOSED / HALF_OPEN → OPEN at appropriate levels.
  • Reject logs — first of cycle at info for the breadcrumb, subsequent at debug. Dry-run rejects at info with [dry-run] prefix.
  • retry budget exhausted at warn: service, in_flight, retries_in_flight, cap (via new RetryBudget::ComputeCap() accessor).
  • circuit breaker config applied at info on every reload.
  • PoolPartition draining wait queue on breaker trip at info with queue_size.

CircuitBreakerManager::SnapshotAll() returns per-host rows with per-slice counters (state, trips, rejected, probe_successes, probe_failures, RejectedHalfOpenFull, ReportsStaleGeneration) and host aggregates (total_trips, total_rejected, open_partitions, half_open_partitions, retries_in_flight, retries_rejected, in_flight). A future /admin/breakers endpoint would JSON-serialize this.

Hot-reload

  • HttpServer::Reload invokes circuit_breaker_manager_->Reload(new_config.upstreams) unconditionally — idempotent when no CB fields changed (atomic stores).
  • Per-slice Slice::Reload is enqueued on the owning dispatcher so config mutations happen on the correct thread.
  • Live state preserved on threshold-only edits; silent full reset on enabled toggle.
  • Topology warn rephrased to disambiguate: "upstream topology changes require a restart to take effect (circuit-breaker field edits, if any, were applied live)".
  • upstream_configs_ baseline persisted post-reload so subsequent reloads diff against the latest state.

Development review history

The feature was built iteratively. Major review-caught regressions are captured as pitfall entries in development rules and regression tests. Highlights:

Core state-machine review rounds — pre-increment shift, Report* state guard, HALF_OPEN saw_failure short-circuit, Reload-across-enabled-toggle reset, saw_failure counter misclassification, generation token for stale-report drop, OpenUntil() cleared in HALF_OPEN, window-resize generation bump, domain-split generation (closed_gen_ / halfopen_gen_), orphaned consecutive_failures_ reset, probe budget snapshot at cycle entry, << 0 crash clamp, JSON strict type accessors, ComputeOpenDuration clamps, half_open_admitted_ monotone counter, ReportNeutral, main.cc reload config save/restore.

Hot-path integration review rounds:

Round Finding Fix
R1 RetryBudget::TryConsumeRetry raced concurrent retries past the cap CAS loop
R1 Cap denominator used raw in_flight, letting retries inflate their own ceiling Non-retry base: subtract retries_in_flight
R1 Retry-After used truncating division (5500ms → 5s) Ceil division
R1 CircuitBreakerManager ctor accepted dispatcher-count mismatch silently Explicit check + throw
R1 host_label format drifted Aligned to service=X host=Y:Z partition=N
R2 OnError paths missed ReportBreakerOutcome before MaybeRetry for timeout / disconnect Added at stale-keep-alive, upstream-disconnect, response-timeout sites
R2 Cancel() didn't release the admission — probe slot stranded on mid-probe abort ReleaseBreakerAdmissionNeutral() in Cancel()
R3 MakeCircuitOpenResponse produced Retry-After: 0 in HALF_OPEN State-branched — HALF_OPEN uses NextOpenDurationMs()
R4 Round 3's signal-preserving cancel let client aborts trip a healthy backend (DOS) Cancel() always neutral
R4 HALF_OPEN Retry-After reuses base only — exp-backoff invisible NextOpenDurationMs()
R4 Public getter OpenUntil() contract conflict across states Added IsOpenDeadlineSet(), config(), NextOpenDurationMs()
R5 worker_threads=2 flaky for sharding Single worker across integration tests
R5 /echo/toggle route 404 — backend only registered /fail Route aligned + trips == 0 assertion
R5 TestHalfOpenRetryAfterScalesWithBackoff had recovery cycles resetting consecutive_trips_ Rewrote to drive trips via probe failures only

Each test flagged at R5 was re-verified by injecting the described regression — the test failed, confirming the guard works.


Tests (+105 total, 365 → 470)

Config (test/config_test.h)

Defaults, JSON parse, partial block, round-trip, equality (CB excluded from UpstreamConfig::operator==), 13 validation cases, 3 type-strictness cases.

Circuit-breaker test suites (test/circuit_breaker_*_test.h)

File Scope Count
circuit_breaker_test.h State machine + window unit tests (generation tokens, reload variants, clamp regressions, neutral-release, transition callback). 45
circuit_breaker_components_test.h RetryBudget + CircuitBreakerHost + CircuitBreakerManager component unit tests. 11
circuit_breaker_integration_test.h End-to-end through HttpServer: bare proxy · consecutive-5xx trip · disabled passthrough · 2xx success resets · trip drives slice state · OPEN short-circuits upstream · Retry-After value · circuit-open terminal for retry · dry-run passthrough · HALF_OPEN recovery round-trip · Retry-After ceil · retried failures count toward trip · HALF_OPEN reject label · HALF_OPEN Retry-After exponential-aware. 14
circuit_breaker_retry_budget_test.h Budget rejects retry · min-concurrency floor admits retries · dry-run passthrough · first attempts not gated. 4
circuit_breaker_wait_queue_drain_test.h Queue drained on trip (B sees 503, backend_hits==1) · disabled breaker doesn't drain (backend_hits==2). 2
circuit_breaker_observability_test.h Snapshot reflects counters · trip log field presence (via ringbuffer sink) · retry-budget observability (log fields + retries_rejected >= 1). 3
circuit_breaker_reload_test.h Reload propagates to live slice · CB-only reload emits no topology warn · topology change still warns · disable→enable cycle. 4

Build system

  • Makefile: CIRCUIT_BREAKER_SRCS = 5 .cc files; CIRCUIT_BREAKER_HEADERS = 6 .h files; TEST_HEADERS includes all 6 circuit_breaker*_test.h files plus the core unit suite.
  • ./test_runner circuit_breaker (or -B) runs every circuit-breaker suite.

Documentation

  • Public user guide: docs/circuit_breaker.md — configuration fields, client-facing responses, hot-reload semantics, observability, short design notes.

Test plan

  • make clean && make -j4 produces a clean build.
  • ./test_runner passes all 470 tests.
  • ./test_runner circuit_breaker (or -B) runs the circuit-breaker suites in isolation.
  • With circuit_breaker.enabled=false (the default), there is no behavioral change to production traffic — hot path is a single branch read against a nullptr slice or the disabled fast path.
  • SIGHUP with a pure CB-field edit applies live and emits "circuit breaker config applied" without a restart warn. Topology edits emit "upstream topology changes require a restart to take effect (circuit-breaker field edits, if any, were applied live)" — the CB portion of such a reload is still applied live.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a per-dispatcher circuit breaker for upstream hosts, featuring a sliding window for failure tracking and support for exponential backoff during recovery. The implementation includes configuration parsing, validation, and comprehensive unit tests. Review feedback focuses on preventing out-of-bounds access in the sliding window indexing, optimizing the HALF_OPEN state to halt probes after a failure is detected, and adjusting log levels for dry-run rejections to avoid log flooding.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49a2ae9ce9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mwfj mwfj changed the title Support Circut break Phase1-2 Support Circut break Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant