Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a per-dispatcher circuit breaker for upstream hosts, featuring a sliding window for failure tracking and support for exponential backoff during recovery. The implementation includes configuration parsing, validation, and comprehensive unit tests. Review feedback focuses on preventing out-of-bounds access in the sliding window indexing, optimizing the HALF_OPEN state to halt probes after a failure is detected, and adjusting log levels for dry-run rejections to avoid log flooding.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 49a2ae9ce9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Summary
Adds per-upstream circuit breaking to the gateway, preventing cascading failures when a backend becomes unhealthy. Tracks upstream failures on a resilience4j-style three-state machine (CLOSED → OPEN → HALF_OPEN → CLOSED), trips on either consecutive-failure or failure-rate thresholds, and short-circuits checkouts with
503 Service Unavailablewhile the circuit is OPEN. A separate retry budget caps the fraction of concurrent upstream work that may be retries, bounding the retry-storm amplification factor even when individual retries pass the breaker gate.What's in this PR
Config schema
CircuitBreakerConfigstruct ininclude/config/server_config.h(12 fields:enabled,dry_run, thresholds, window, half-open budget, open-duration bounds, retry-budget tuning).UpstreamConfig. TheUpstreamConfig::operator==equality operator EXCLUDEScircuit_breakerbecause those fields are live-reloadable — topology fields (name, host, port, tls, pool, proxy) remain restart-only.is_number_integer/is_boolean— rejects1.9 → 1andtrue → 1silent coercions).ConfigLoader::Validate, oneinvalid_argumentper rule, upper bounds onconsecutive_failure_threshold(≤10k),minimum_volume(≤10M),permitted_half_open_calls(≤1k).ToJsonserialization.Core state machine + sliding window
include/circuit_breaker/:circuit_breaker_state.h—State,Decision { ADMITTED, ADMITTED_PROBE, REJECTED_OPEN, REJECTED_OPEN_DRYRUN },FailureKind,StateTransitionCallback.circuit_breaker_window.h/.cc— time-bucketed sliding window (ring of per-second buckets, lazy advance, dispatcher-thread-local, no locks). Constructor clamps non-positivewindow_secondsto 1.circuit_breaker_slice.h/.cc— per-dispatcher breaker slice with:TryAcquire.base << consecutive_trips, capped at max);ComputeOpenDurationclamps non-positive / inverted bounds at use.half_open_admitted_(monotone per-cycle counter, not inflight — prevents slot-reuse after one probe completes).permitted_half_open_callsat cycle entry so a mid-cycle reload can't change the budget for the running cycle.REJECTED_OPEN_DRYRUN; caller proceeds).closed_gen_/halfopen_gen_) — stale reports drop silently; window-resize bump doesn't strand in-flight probes.TryAcquire()returnsAdmission { Decision, uint64_t generation };Report{Success,Failure,Neutral}takes the admission generation.ReportNeutral— slot-release path for admissions that terminate locally (POOL_EXHAUSTED, shutdown, client disconnect) without counting as success or failure.enabledtoggle; window-resize also resetsconsecutive_failures_.if (!config_.enabled) return ADMITTED;early return, zero atomic traffic when off.IsOpenDeadlineSet(),config(),NextOpenDurationMs().Host / Manager / RetryBudget
retry_budget.h/.cc—RetryBudgetclass. RAIIInFlightGuardfor per-attempt bookkeeping. CAS loopTryConsumeRetry(concurrent retries cannot race past the cap) with a non-retry denominator (cap =max(min_conc, (in_flight - retries_in_flight) * percent / 100)), so in steady state the effective retry fraction matches the configured percent rather than drifting above it.ComputeCap()observability accessor.circuit_breaker_host.h/.cc—CircuitBreakerHostowns N slices (one per dispatcher partition) + one sharedRetryBudget.Snapshot()aggregates per-slice counters + retry-budget state.Reload()fans out per-sliceSlice::Reloadcalls viaDispatcher::EnQueue.host_labelformat:service=<svc> host=<h>:<p> partition=<i>.circuit_breaker_manager.h/.cc—CircuitBreakerManagerkeyed by service name. Topology stable post-construction (lock-freeGetHost). Constructor validates dispatcher-count vs config partition-count mismatch (throws; skipped when dispatchers is empty for unit-test paths).Reload()serialized by mutex.Hot-path integration — ProxyTransaction + UpstreamManager + HttpServer
Ownership & wiring:
HttpServer::circuit_breaker_manager_— declared AFTERupstream_manager_so destruction runs breaker-first.UpstreamManager::AttachCircuitBreakerManager(raw*)— atomic non-owning pointer (release/acquire).HttpServer::MarkServerReadyinstalls a per-slice transition callback capturing(service, dispatcher_index). Fires only on CLOSED→OPEN. Wired for ALL upstreams regardless ofenabledso live reload fromenabled=false→trueworks without re-wiring.Result codes:
PoolPartition::CHECKOUT_CIRCUIT_OPEN = -6(delivered to wait-queue waiters drained on a breaker trip).ProxyTransaction::RESULT_CIRCUIT_OPEN = -7,RESULT_RETRY_BUDGET_EXHAUSTED = -8.ProxyTransactionhot-path changes:slice_+retry_budget_resolved once atStart().AttemptCheckoutcallsConsultBreaker()at the top; each attempt (first + retries) gets a fresh admission stamped with the slice's current generation.inflight_guard_(RAII) replaced on everyAttemptCheckout— stays at exactly one in_flight unit per transaction.MaybeRetrycallsTryConsumeRetrybefore committing to the retry; exhausted →DeliverResponse(MakeRetryBudgetResponse())(terminal, not reported).ReportBreakerOutcome(result_code)classifies per design §7 and fires BEFOREMaybeRetryat every failure site so the retry's freshConsultBreakersees the latest count.ReleaseBreakerAdmissionNeutral()inCancel()— client-disconnect always neutral (replacement probe slot acceptable; tripping a healthy backend on user-side abandonment would be a DOS vector).Response factories:
MakeCircuitOpenResponse()— state-aware Retry-After: OPEN reads storedslice->OpenUntil(); HALF_OPEN usesslice->NextOpenDurationMs()(exponential-backoff aware). Ceil division. Absolute cap 3600s.X-Circuit-Breaker: open|half_openlabel (distinguishes the two reject paths).MakeRetryBudgetResponse()— 503 +X-Retry-Budget-Exhausted: 1+Connection: close. NoRetry-After(budget has no recovery clock).Wait-queue drain on trip
PoolPartition::DrainWaitQueueOnTrip()— dispatcher-thread. Iterateswait_queue_, pops each entry, fireserror_callback(CHECKOUT_CIRCUIT_OPEN)on non-cancelled waiters. Skips ifshutting_down_(InitiateShutdown is already draining with CHECKOUT_SHUTTING_DOWN). Hoistsalive_against teardown re-entry. Does NOT setshutting_down_— this is a transient drain; the partition keeps its connections for HALF_OPEN probing.UpstreamManager::GetPoolPartition(service, index)accessor.Observability
All events surface through structured logs + a snapshot API. Full log catalog is in docs/circuit_breaker.md §Observability; highlights:
CLOSED → OPENtrip atwarn:trigger,consecutive_failures,window_total,window_fail_rate,open_for_ms,consecutive_trips(captured pre-reset so operators can distinguish a consecutive trip from a rate trip).OPEN → HALF_OPEN/HALF_OPEN → CLOSED/HALF_OPEN → OPENat appropriate levels.infofor the breadcrumb, subsequent atdebug. Dry-run rejects atinfowith[dry-run]prefix.retry budget exhaustedatwarn:service,in_flight,retries_in_flight,cap(via newRetryBudget::ComputeCap()accessor).circuit breaker config appliedatinfoon every reload.PoolPartition draining wait queue on breaker tripatinfowithqueue_size.CircuitBreakerManager::SnapshotAll()returns per-host rows with per-slice counters (state,trips,rejected,probe_successes,probe_failures,RejectedHalfOpenFull,ReportsStaleGeneration) and host aggregates (total_trips,total_rejected,open_partitions,half_open_partitions,retries_in_flight,retries_rejected,in_flight). A future/admin/breakersendpoint would JSON-serialize this.Hot-reload
HttpServer::Reloadinvokescircuit_breaker_manager_->Reload(new_config.upstreams)unconditionally — idempotent when no CB fields changed (atomic stores).Slice::Reloadis enqueued on the owning dispatcher so config mutations happen on the correct thread.enabledtoggle."upstream topology changes require a restart to take effect (circuit-breaker field edits, if any, were applied live)".upstream_configs_baseline persisted post-reload so subsequent reloads diff against the latest state.Development review history
The feature was built iteratively. Major review-caught regressions are captured as pitfall entries in development rules and regression tests. Highlights:
Core state-machine review rounds — pre-increment shift, Report* state guard, HALF_OPEN
saw_failureshort-circuit, Reload-across-enabled-toggle reset,saw_failurecounter misclassification, generation token for stale-report drop,OpenUntil()cleared in HALF_OPEN, window-resize generation bump, domain-split generation (closed_gen_/halfopen_gen_), orphanedconsecutive_failures_reset, probe budget snapshot at cycle entry,<< 0crash clamp, JSON strict type accessors,ComputeOpenDurationclamps,half_open_admitted_monotone counter,ReportNeutral, main.cc reload config save/restore.Hot-path integration review rounds:
RetryBudget::TryConsumeRetryraced concurrent retries past the capin_flight, letting retries inflate their own ceilingretries_in_flightRetry-Afterused truncating division (5500ms → 5s)CircuitBreakerManagerctor accepted dispatcher-count mismatch silentlyhost_labelformat driftedservice=X host=Y:Z partition=NOnErrorpaths missedReportBreakerOutcomebeforeMaybeRetryfor timeout / disconnectCancel()didn't release the admission — probe slot stranded on mid-probe abortReleaseBreakerAdmissionNeutral()inCancel()MakeCircuitOpenResponseproducedRetry-After: 0in HALF_OPENNextOpenDurationMs()Cancel()always neutralNextOpenDurationMs()OpenUntil()contract conflict across statesIsOpenDeadlineSet(),config(),NextOpenDurationMs()worker_threads=2flaky for sharding/echo/toggleroute 404 — backend only registered/failtrips == 0assertionTestHalfOpenRetryAfterScalesWithBackoffhad recovery cycles resettingconsecutive_trips_Each test flagged at R5 was re-verified by injecting the described regression — the test failed, confirming the guard works.
Tests (+105 total, 365 → 470)
Config (
test/config_test.h)Defaults, JSON parse, partial block, round-trip, equality (CB excluded from
UpstreamConfig::operator==), 13 validation cases, 3 type-strictness cases.Circuit-breaker test suites (
test/circuit_breaker_*_test.h)circuit_breaker_test.hcircuit_breaker_components_test.hRetryBudget+CircuitBreakerHost+CircuitBreakerManagercomponent unit tests.circuit_breaker_integration_test.hcircuit_breaker_retry_budget_test.hcircuit_breaker_wait_queue_drain_test.hcircuit_breaker_observability_test.hretries_rejected >= 1).circuit_breaker_reload_test.hBuild system
Makefile:CIRCUIT_BREAKER_SRCS= 5.ccfiles;CIRCUIT_BREAKER_HEADERS= 6.hfiles;TEST_HEADERSincludes all 6circuit_breaker*_test.hfiles plus the core unit suite../test_runner circuit_breaker(or-B) runs every circuit-breaker suite.Documentation
Test plan
make clean && make -j4produces a clean build../test_runnerpasses all 470 tests../test_runner circuit_breaker(or-B) runs the circuit-breaker suites in isolation.circuit_breaker.enabled=false(the default), there is no behavioral change to production traffic — hot path is a single branch read against a nullptr slice or the disabled fast path."circuit breaker config applied"without a restart warn. Topology edits emit"upstream topology changes require a restart to take effect (circuit-breaker field edits, if any, were applied live)"— the CB portion of such a reload is still applied live.