fix(opencode): broaden transport errno coverage and stop retrying permanent DNS failures#1467
fix(opencode): broaden transport errno coverage and stop retrying permanent DNS failures#1467Astro-Han wants to merge 1 commit into
Conversation
…manent DNS failures The stream failure classifier only recognized 4 errno codes (ECONNRESET/ECONNREFUSED/ETIMEDOUT/UND_ERR_SOCKET) and hardcoded every transport disconnect as retryable. Two gaps: common transient codes (DNS, undici timeouts, broken pipe) fell through to UnknownError, and a permanent failure like ENOTFOUND (an unresolved host — usually a misconfigured base URL) was retried up to the cap, stalling the turn instead of surfacing the cause. - stream-failure-classifier.ts: broaden TRANSPORT_CODES (EAI_AGAIN, EPIPE, ECONNABORTED, EHOSTUNREACH, ENETUNREACH, ENOTFOUND, UND_ERR_CONNECT_TIMEOUT, UND_ERR_HEADERS_TIMEOUT, UND_ERR_BODY_TIMEOUT) and add a tiny message fallback (socket hang up / premature close) for disconnects that arrive with no errno code. An error carrying an HTTP statusCode is short-circuited up front: an HTTP response was received, so it is an API error classified by status downstream and must not be reclassified as transport — neither by a transport-coded cause nor by a transport phrase in its message (fromError runs this classifier before the APICallError branch). "terminated" is intentionally NOT in the fallback — on its own it is undici's generic wrapper and is only transport when its cause carries a transport code. - Make retryability per-errno: TransportDisconnect.retryable is now a boolean. ENOTFOUND is non-retryable (permanent DNS / misconfigured base URL); EAI_AGAIN (transient DNS) and all others stay retryable. - message-v2.ts fromError: propagate transport.retryable to APIError.isRetryable instead of hardcoding true. - retry.ts classifyRetry: transport_disconnect now honors isRetryable (removed from RETRY_TRANSIENT_KINDS, given its own branch) so a non-retryable transport error takes the existing "not retryable -> stop" path. retrySignal.retryable also feeds the recorded transport failure, so run-observability's retry_safety reports provider_terminal_failure for ENOTFOUND via the existing #1118 branch. Tests: classifier (ENOTFOUND non-retryable incl. nested cause, EAI_AGAIN / UND_ERR_HEADERS_TIMEOUT retryable, bare "socket hang up" message fallback, HTTP error with a transport phrase or transport-coded cause stays unclassified); classifyRetry (transport_disconnect honors isRetryable both ways); fromError (ENOTFOUND -> non-retryable APIError, EAI_AGAIN -> retryable, APICallError with "socket hang up" message + transport cause stays invalid_request). Refs #1105 #1123 Claude-Session: https://claude.ai/code/session_015bW9JQSkuB156gkNQdxCzi
|
Warning Review limit reached
More reviews will be available in 26 minutes and 4 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Broadens the stream transport-failure classifier and makes its retryability per-errno. Before,
classifyStreamFailurerecognized only 4 errno codes (ECONNRESET/ECONNREFUSED/ETIMEDOUT/UND_ERR_SOCKET) and hardcoded every transport disconnect as retryable. Two gaps:UnknownError.ENOTFOUND(an unresolved host — usually a misconfigured base URL) was retried up to the cap, stalling the turn instead of surfacing the cause.Changes:
stream-failure-classifier.ts: broadenTRANSPORT_CODES(EAI_AGAIN,EPIPE,ECONNABORTED,EHOSTUNREACH,ENETUNREACH,ENOTFOUND,UND_ERR_CONNECT_TIMEOUT,UND_ERR_HEADERS_TIMEOUT,UND_ERR_BODY_TIMEOUT) + a tiny message fallback (socket hang up/premature close) for disconnects with no errno code. An error carrying an HTTPstatusCodeis short-circuited up front — an HTTP response was received, so it is an API error classified by status downstream (theAPICallErrorbranch infromError, which runs after this) and must not be reclassified as transport by a transport-coded cause or a transport phrase in its message."terminated"is intentionally not in the fallback (on its own it is undici's generic wrapper; only transport when its cause carries a transport code).TransportDisconnect.retryableis now a boolean.ENOTFOUNDis non-retryable;EAI_AGAIN(transient DNS) and all others stay retryable.message-v2.tsfromError: propagatetransport.retryabletoAPIError.isRetryableinstead of hardcodingtrue.retry.tsclassifyRetry:transport_disconnectnow honorsisRetryable(removed fromRETRY_TRANSIENT_KINDS, given its own branch), so a non-retryable transport error takes the existing "not retryable → stop" path.retrySignal.retryablealso feeds the recorded transport failure, so run-observability'sretry_safetyreportsprovider_terminal_failureforENOTFOUNDvia the existing fix(opencode): stop recommending auto-retry for terminal provider failures #1118 branch.Why
Split out of the error-classification work (#1105 / #1123) as its own slice because it changes retry strategy, not just classification.
ENOTFOUNDfrom a misconfigured base URL should fail fast with the real cause, not retry-storm into a stall; and the narrow errno set was dropping recoverable transient disconnects into opaqueUnknownError.Related Issue
Refs #1105 (classification spine), #1123 (classify-then-render umbrella).
Human Review Status
Pending
Review Focus
statusCodeshort-circuit: it must not drop a real raw transport error (raw transport errors carrycode/cause.codeand no HTTP status; only HTTP responses havestatusCode).classifyRetrynow honorsisRetryablefortransport_disconnect— confirm existing transport errnos (all retryable) still retry; onlyENOTFOUNDstops."terminated"from the message fallback (kept consistent with the existing over-match guard).Risk Notes
Behavior change (intentional, tested):
ENOTFOUNDno longer auto-retries.classifyRetry's transient-kind contract changed fortransport_disconnect(now honorsisRetryable); one existing test that asserted "transport_disconnect retries even when isRetryable is false" was updated to the new per-errno contract. No data migration. No UI.How To Verify
Pinned:
ENOTFOUND→ non-retryable transport (incl. nested cause);EAI_AGAIN/UND_ERR_HEADERS_TIMEOUT→ retryable; bare"socket hang up"→ retryable via message fallback; HTTP error with a transport phrase or transport-coded cause stays classified by status;classifyRetryhonorsisRetryablefortransport_disconnectboth ways.Screenshots or Recordings
N/A — backend transport-classification change, no UI.
Checklist
bug,enhancement,task,documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.app,ui,platform,harness,ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.P0,P1,P2,P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.Pending,Approved by @<reviewer>, orNot required: <reason>(default isPending; "not required" is restricted to bot-authored low-risk PRs).https://claude.ai/code/session_015bW9JQSkuB156gkNQdxCzi