Skip to content

Fix OAuth per-provider stall overrides and failover config leak#519

Open
sirn wants to merge 3 commits into
mcowger:mainfrom
sirn:fix-oauth-stall-detection
Open

Fix OAuth per-provider stall overrides and failover config leak#519
sirn wants to merge 3 commits into
mcowger:mainfrom
sirn:fix-oauth-stall-detection

Conversation

@sirn
Copy link
Copy Markdown
Contributor

@sirn sirn commented May 21, 2026

This PR fixes the following issues with stall detection:

Per-provider stall configuration leak

Since StallInspector is shared between all providers in a request, when failover moves from Provider A (with per-provider overrides) to Provider B (without per-provider overrides), Provider A's overrides is leaked onto Provider B.

Steps to reproduce

  1. Make sure no stall configuration in global config
  2. Configure Provider A (with a mock slow server) with a strict TTFB timeout/TTFB byte threshold, e.g. 5s/5000B
  3. Configure Provider B (with the same mock slow server) without TTFB timeout/TTFB byte threshold (use default)
  4. Configure model alias with Provider A and B
  5. Send a request to Provider A, fails, failover to Provider B
  6. Provider B also treated as stall despite having no stall config (should use global)

Root cause

The dispatcher has a guard around addStallConfig. When Provider B has no overrides, the StallInspector is never reconfigured. Additionally, when resolveStallConfig is null (stall detection off), addStallConfig skipped updateConfig instead of resetting it, so global "null" default (i.e. no stall check) is never applied.

Fix

Made sure addStallConfig is always called and make addStallConfig use the default stall settings when resolveStallConfig returns null.

Per-provider stall configuration were not being applied to OAuth providers

OAuth providers have its own code path (if (this.isPiAiRoute(...))), but it was being called before provider stall overrides are computed. This means that OAuth providers will always use global stall configuration regardless of the presence of the stall configuration overrides.

Steps to reproduce

  1. Configure global stall configuration with a strict TTFB timeout/TTFB byte threshold, e.g. 5s/5000B
  2. Configure OAuth provider with a relaxed TTFB timeout/TTFB byte threshold, e.g. 30s/50B
  3. Send a streaming request to heavy OAuth models (e.g. gpt-5.5)
  4. OAuth provider stalled despite having a very relaxed configuration

Fix

Moved the per-provider wiring before dispatching to pi-ai. Also deduplicated resolveStallConfig logic and added a debug log.

OAuth providers have no TTFB detection code path

probeOAuthStreamStart has no TTFB abort mechanism.

Fix

Wired per-request TTFB abort controller inside dispatchOAuthRequest.

sirn added 3 commits May 22, 2026 02:54
When failover iterates from Provider A (with stall overrides) to
Provider B (without overrides), addStallConfig skipped updateConfig
when resolveStallConfig returned null. The StallInspector retained
A's config, so B was monitored with A's thresholds.

Extract DEFAULT_STALL_CONFIG as a module-level constant. Change
wireStallDetection return type to non-null — the function already
returned a valid object, the declared type just lagged. Make
addStallConfig always call updateConfig, falling back to
DEFAULT_STALL_CONFIG when resolveStallConfig returns null, so a
previous provider's overrides cannot survive into the next target.

Add getConfig() accessor to StallInspector for test observability.
Add stall.test.ts verifying the inspector config is properly reset
when provider overrides are nulled out or when empty overrides
are passed after a previous provider's overrides were applied.
The addTimeoutSource, addStallConfig, and effectiveStallConfig blocks
ran after the isPiAiRoute branch, so OAuth routes skipped them
entirely. The StallInspector used the global config (or a previous
provider's overrides) instead of the per-provider stallTtfbMs /
stallMinBps settings configured on the OAuth provider.

Move the wiring block before the isPiAiRoute branch so both OAuth
and non-OAuth routes receive per-provider stall detection. Remove
the Object.keys guard on addStallConfig — it must be called
unconditionally so the StallInspector is reset on each failover
iteration (relying on the addStallConfig fix in the previous commit).
Pass effectiveStallConfig to dispatchOAuthRequest and log it for
debugging. The TTFB timer and probe deadline are not yet wired
(follow-up commit).
OAuth streaming routes had no TTFB timeout — if the provider was slow
to start responding, the request would hang indefinitely. The global
ttfbSeconds config and per-provider stallTtfbMs overrides were
computed but never enforced on the OAuth path because probeOAuthStream-
Start had no deadline mechanism and dispatchOAuthRequest had no
abort controller for stall detection.

Wire a per-request TTFB abort controller inside dispatchOAuthRequest
that races the executeRequest + probeOAuthStreamStart phases against
an absolute deadline. The abort controller is separate from the
client signal so stall aborts are distinguished from disconnects.

Intercept stall AbortErrors before wrapOAuthError can swallow them —
the OAuth transformer rewrites AbortError to generic 'Upstream
timeout', losing the stall message. Detect stallAbortController.signal
and re-throw with .isStallError = true.

Add deadline-aware probe loop in probeOAuthStreamStart that races each
reader.read() against the remaining TTFB budget. Bookkeeping events
do not reset the clock; TTFB is defined as time to first
non-bookkeeping event. Each per-read timer is cleaned up in a finally
block and reader.cancel() releases the upstream connection on timeout.

Add stall-aware failover in the OAuth catch: stall errors trigger
markProviderStallFailure (not generic failure) and continue to the
next target when failover is enabled. Even on the last target, the
stalled provider is properly cooled down.

Cancel the returned stream when the deadline is already exceeded
after executeRequest returns, preventing upstream connection leaks
during failover.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant