Fix OAuth per-provider stall overrides and failover config leak#519
Open
sirn wants to merge 3 commits into
Open
Fix OAuth per-provider stall overrides and failover config leak#519sirn wants to merge 3 commits into
sirn wants to merge 3 commits into
Conversation
When failover iterates from Provider A (with stall overrides) to Provider B (without overrides), addStallConfig skipped updateConfig when resolveStallConfig returned null. The StallInspector retained A's config, so B was monitored with A's thresholds. Extract DEFAULT_STALL_CONFIG as a module-level constant. Change wireStallDetection return type to non-null — the function already returned a valid object, the declared type just lagged. Make addStallConfig always call updateConfig, falling back to DEFAULT_STALL_CONFIG when resolveStallConfig returns null, so a previous provider's overrides cannot survive into the next target. Add getConfig() accessor to StallInspector for test observability. Add stall.test.ts verifying the inspector config is properly reset when provider overrides are nulled out or when empty overrides are passed after a previous provider's overrides were applied.
The addTimeoutSource, addStallConfig, and effectiveStallConfig blocks ran after the isPiAiRoute branch, so OAuth routes skipped them entirely. The StallInspector used the global config (or a previous provider's overrides) instead of the per-provider stallTtfbMs / stallMinBps settings configured on the OAuth provider. Move the wiring block before the isPiAiRoute branch so both OAuth and non-OAuth routes receive per-provider stall detection. Remove the Object.keys guard on addStallConfig — it must be called unconditionally so the StallInspector is reset on each failover iteration (relying on the addStallConfig fix in the previous commit). Pass effectiveStallConfig to dispatchOAuthRequest and log it for debugging. The TTFB timer and probe deadline are not yet wired (follow-up commit).
OAuth streaming routes had no TTFB timeout — if the provider was slow to start responding, the request would hang indefinitely. The global ttfbSeconds config and per-provider stallTtfbMs overrides were computed but never enforced on the OAuth path because probeOAuthStream- Start had no deadline mechanism and dispatchOAuthRequest had no abort controller for stall detection. Wire a per-request TTFB abort controller inside dispatchOAuthRequest that races the executeRequest + probeOAuthStreamStart phases against an absolute deadline. The abort controller is separate from the client signal so stall aborts are distinguished from disconnects. Intercept stall AbortErrors before wrapOAuthError can swallow them — the OAuth transformer rewrites AbortError to generic 'Upstream timeout', losing the stall message. Detect stallAbortController.signal and re-throw with .isStallError = true. Add deadline-aware probe loop in probeOAuthStreamStart that races each reader.read() against the remaining TTFB budget. Bookkeeping events do not reset the clock; TTFB is defined as time to first non-bookkeeping event. Each per-read timer is cleaned up in a finally block and reader.cancel() releases the upstream connection on timeout. Add stall-aware failover in the OAuth catch: stall errors trigger markProviderStallFailure (not generic failure) and continue to the next target when failover is enabled. Even on the last target, the stalled provider is properly cooled down. Cancel the returned stream when the deadline is already exceeded after executeRequest returns, preventing upstream connection leaks during failover.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes the following issues with stall detection:
Per-provider stall configuration leak
Since
StallInspectoris shared between all providers in a request, when failover moves from Provider A (with per-provider overrides) to Provider B (without per-provider overrides), Provider A's overrides is leaked onto Provider B.Steps to reproduce
5s/5000BRoot cause
The dispatcher has a guard around
addStallConfig. When Provider B has no overrides, theStallInspectoris never reconfigured. Additionally, whenresolveStallConfigisnull(stall detection off),addStallConfigskippedupdateConfiginstead of resetting it, so global "null" default (i.e. no stall check) is never applied.Fix
Made sure
addStallConfigis always called and makeaddStallConfiguse the default stall settings whenresolveStallConfigreturnsnull.Per-provider stall configuration were not being applied to OAuth providers
OAuth providers have its own code path (
if (this.isPiAiRoute(...))), but it was being called before provider stall overrides are computed. This means that OAuth providers will always use global stall configuration regardless of the presence of the stall configuration overrides.Steps to reproduce
5s/5000B30s/50Bgpt-5.5)Fix
Moved the per-provider wiring before dispatching to pi-ai. Also deduplicated
resolveStallConfiglogic and added a debug log.OAuth providers have no TTFB detection code path
probeOAuthStreamStarthas no TTFB abort mechanism.Fix
Wired per-request TTFB abort controller inside
dispatchOAuthRequest.