Skip to content

fix(opencode): broaden transport errno coverage and stop retrying permanent DNS failures#1467

Open
Astro-Han wants to merge 1 commit into
devfrom
claude/i1105b-transport-errno
Open

fix(opencode): broaden transport errno coverage and stop retrying permanent DNS failures#1467
Astro-Han wants to merge 1 commit into
devfrom
claude/i1105b-transport-errno

Conversation

@Astro-Han

@Astro-Han Astro-Han commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Summary

Broadens the stream transport-failure classifier and makes its retryability per-errno. Before, classifyStreamFailure recognized only 4 errno codes (ECONNRESET/ECONNREFUSED/ETIMEDOUT/UND_ERR_SOCKET) and hardcoded every transport disconnect as retryable. Two gaps:

  • Common transient codes (DNS, undici timeouts, broken pipe) fell through to UnknownError.
  • A permanent failure like ENOTFOUND (an unresolved host — usually a misconfigured base URL) was retried up to the cap, stalling the turn instead of surfacing the cause.

Changes:

  • stream-failure-classifier.ts: broaden TRANSPORT_CODES (EAI_AGAIN, EPIPE, ECONNABORTED, EHOSTUNREACH, ENETUNREACH, ENOTFOUND, UND_ERR_CONNECT_TIMEOUT, UND_ERR_HEADERS_TIMEOUT, UND_ERR_BODY_TIMEOUT) + a tiny message fallback (socket hang up / premature close) for disconnects with no errno code. An error carrying an HTTP statusCode is short-circuited up front — an HTTP response was received, so it is an API error classified by status downstream (the APICallError branch in fromError, which runs after this) and must not be reclassified as transport by a transport-coded cause or a transport phrase in its message. "terminated" is intentionally not in the fallback (on its own it is undici's generic wrapper; only transport when its cause carries a transport code).
  • Per-errno retryability: TransportDisconnect.retryable is now a boolean. ENOTFOUND is non-retryable; EAI_AGAIN (transient DNS) and all others stay retryable.
  • message-v2.ts fromError: propagate transport.retryable to APIError.isRetryable instead of hardcoding true.
  • retry.ts classifyRetry: transport_disconnect now honors isRetryable (removed from RETRY_TRANSIENT_KINDS, given its own branch), so a non-retryable transport error takes the existing "not retryable → stop" path. retrySignal.retryable also feeds the recorded transport failure, so run-observability's retry_safety reports provider_terminal_failure for ENOTFOUND via the existing fix(opencode): stop recommending auto-retry for terminal provider failures #1118 branch.

Why

Split out of the error-classification work (#1105 / #1123) as its own slice because it changes retry strategy, not just classification. ENOTFOUND from a misconfigured base URL should fail fast with the real cause, not retry-storm into a stall; and the narrow errno set was dropping recoverable transient disconnects into opaque UnknownError.

Related Issue

Refs #1105 (classification spine), #1123 (classify-then-render umbrella).

Human Review Status

Pending

Review Focus

  • The statusCode short-circuit: it must not drop a real raw transport error (raw transport errors carry code/cause.code and no HTTP status; only HTTP responses have statusCode).
  • classifyRetry now honors isRetryable for transport_disconnect — confirm existing transport errnos (all retryable) still retry; only ENOTFOUND stops.
  • Dropping "terminated" from the message fallback (kept consistent with the existing over-match guard).

Risk Notes

Behavior change (intentional, tested): ENOTFOUND no longer auto-retries. classifyRetry's transient-kind contract changed for transport_disconnect (now honors isRetryable); one existing test that asserted "transport_disconnect retries even when isRetryable is false" was updated to the new per-errno contract. No data migration. No UI.

How To Verify

bun test src/session/stream-failure-classifier.test.ts src/session/retry.test.ts test/session/message-v2.test.ts → 111 pass, 0 fail
bun test test/session src/session test/provider src/provider → 1328 pass, 4 skip, 1 todo, 0 fail
tsgo --noEmit (packages/opencode) → clean
codex review (xhigh, 3 rounds) → final fresh-eye pass: no P0/P1/P2/P3

Pinned: ENOTFOUND → non-retryable transport (incl. nested cause); EAI_AGAIN / UND_ERR_HEADERS_TIMEOUT → retryable; bare "socket hang up" → retryable via message fallback; HTTP error with a transport phrase or transport-coded cause stays classified by status; classifyRetry honors isRetryable for transport_disconnect both ways.

Screenshots or Recordings

N/A — backend transport-classification change, no UI.

Checklist

  • Type label — this PR carries exactly one of bug, enhancement, task, documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.
  • Routing labels — this PR carries at least one of app, ui, platform, harness, ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.
  • Priority label — this PR carries exactly one of P0, P1, P2, P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.
  • Human Review Status above is set to Pending, Approved by @<reviewer>, or Not required: <reason> (default is Pending; "not required" is restricted to bot-authored low-risk PRs).
  • I linked the related issue, or stated in Summary why there is no issue.

https://claude.ai/code/session_015bW9JQSkuB156gkNQdxCzi

…manent DNS failures

The stream failure classifier only recognized 4 errno codes
(ECONNRESET/ECONNREFUSED/ETIMEDOUT/UND_ERR_SOCKET) and hardcoded every
transport disconnect as retryable. Two gaps: common transient codes (DNS,
undici timeouts, broken pipe) fell through to UnknownError, and a permanent
failure like ENOTFOUND (an unresolved host — usually a misconfigured base URL)
was retried up to the cap, stalling the turn instead of surfacing the cause.

- stream-failure-classifier.ts: broaden TRANSPORT_CODES (EAI_AGAIN, EPIPE,
  ECONNABORTED, EHOSTUNREACH, ENETUNREACH, ENOTFOUND, UND_ERR_CONNECT_TIMEOUT,
  UND_ERR_HEADERS_TIMEOUT, UND_ERR_BODY_TIMEOUT) and add a tiny message
  fallback (socket hang up / premature close) for disconnects that arrive with
  no errno code. An error carrying an HTTP statusCode is short-circuited up
  front: an HTTP response was received, so it is an API error classified by
  status downstream and must not be reclassified as transport — neither by a
  transport-coded cause nor by a transport phrase in its message (fromError
  runs this classifier before the APICallError branch). "terminated" is
  intentionally NOT in the fallback — on its own it is undici's generic wrapper
  and is only transport when its cause carries a transport code.
- Make retryability per-errno: TransportDisconnect.retryable is now a boolean.
  ENOTFOUND is non-retryable (permanent DNS / misconfigured base URL); EAI_AGAIN
  (transient DNS) and all others stay retryable.
- message-v2.ts fromError: propagate transport.retryable to APIError.isRetryable
  instead of hardcoding true.
- retry.ts classifyRetry: transport_disconnect now honors isRetryable (removed
  from RETRY_TRANSIENT_KINDS, given its own branch) so a non-retryable transport
  error takes the existing "not retryable -> stop" path. retrySignal.retryable
  also feeds the recorded transport failure, so run-observability's retry_safety
  reports provider_terminal_failure for ENOTFOUND via the existing #1118 branch.

Tests: classifier (ENOTFOUND non-retryable incl. nested cause, EAI_AGAIN /
UND_ERR_HEADERS_TIMEOUT retryable, bare "socket hang up" message fallback, HTTP
error with a transport phrase or transport-coded cause stays unclassified);
classifyRetry (transport_disconnect honors isRetryable both ways); fromError
(ENOTFOUND -> non-retryable APIError, EAI_AGAIN -> retryable, APICallError with
"socket hang up" message + transport cause stays invalid_request).

Refs #1105 #1123

Claude-Session: https://claude.ai/code/session_015bW9JQSkuB156gkNQdxCzi
@Astro-Han Astro-Han added the bug Something isn't working label Jun 22, 2026
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@Astro-Han, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 26 minutes and 4 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d71ce1b7-e42b-4a2d-9f72-59b37e9bb50f

📥 Commits

Reviewing files that changed from the base of the PR and between f433411 and 6e39ced.

📒 Files selected for processing (6)
  • packages/opencode/src/session/message-v2.ts
  • packages/opencode/src/session/retry.test.ts
  • packages/opencode/src/session/retry.ts
  • packages/opencode/src/session/stream-failure-classifier.test.ts
  • packages/opencode/src/session/stream-failure-classifier.ts
  • packages/opencode/test/session/message-v2.test.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/i1105b-transport-errno

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority labels Jun 22, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant