Skip to content

fix(opencode): classify provider API rejections as a terminal cause and stop overwriting their message#1468

Open
Astro-Han wants to merge 1 commit into
devfrom
claude/i1105-pr2-run-incident-provider-api
Open

fix(opencode): classify provider API rejections as a terminal cause and stop overwriting their message#1468
Astro-Han wants to merge 1 commit into
devfrom
claude/i1105-pr2-run-incident-provider-api

Conversation

@Astro-Han

@Astro-Han Astro-Han commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Summary

Adds a provider_api_error terminal cause so a provider's API-level rejection is classified as what it is, and stops halt() from overwriting that rejection's real message with a generic connection-lost string. This is the run-incident + render half of the error-classification work; the classification half (402 → quota_exhausted) is PR #1466.

  • New terminal cause / classification. provider_api_error is added to run-incident's TerminalCause and to run-observability's Classification, with subcategory taken from providerFailure.kind (auth / rate_limit / quota_exhausted / server_overload / invalid_request / unknown). Both RUN_INCIDENT_SCHEMA_VERSION and RunObservability SCHEMA_VERSION bump to 2; the export pipeline preserves each existing package's own version (export.ts), so old bundles still read back as 1.
  • Parse once, route correctly. The processor parses the failure once and passes providerFailure (with HTTP evidence) into recordAttemptFailureAndDeriveRecovery. The recorder routes a real provider API rejection to provider_api_error instead of defaulting it to provider_transport_disconnect. The recorder does not parse a third time. classificationForIncident and retrySafetyFor gain provider_api_error branches: retryable === falsedo_not_auto_retry / provider_terminal_failure, aligned with the fix(opencode): stop recommending auto-retry for terminal provider failures #1118 retry_safety axis; a retryable one stays candidate_safe_auto_retry.
  • Recovery + message passthrough. recoveryFor stops a terminal provider API error with reason provider_api_error (kept out of the connection-lost reason set); a retryable provider API error (rate_limit / server_overload) flows through the existing auto-retry tree via retryableProviderFailure. The terminal halt() no longer overwrites a provider API rejection's real message with the connection-lost recovery string — while lifecycle-close and user-cancel halts keep their authoritative interruption messages.
  • Evidence gate. isProviderApiError gates the catch-all unknown kind on HTTP evidence (a status code or a response body). A wrapped connection failure — e.g. an APICallError with no HTTP response, or a DNS failure the stream classifier didn't recognize — stays on the transport path rather than being mislabeled a provider API error. Explicit kinds are self-evidencing.

Why

A Windows user's DeepSeek-direct account in arrears returned 402 {"error":{"message":"Insufficient Balance",...}}, but it surfaced as "Connection lost. Please check whether the last operation completed before resending." Three layers caused this: (1) 402 classified as unknown; (2) run-incident had no provider-API category, so it defaulted to provider_transport_disconnect; (3) halt() overwrote the real message with a recovery string. Layers 2 and 3 are fixed here; layer 1 is PR #1466. Combined, a billing failure now surfaces with its real provider message instead of "Connection lost".

Related Issue

Refs #1105 (classification spine), #1123 (classify-then-render umbrella).

Human Review Status

Pending

Review Focus

  • The routing discriminant (isProviderApiError): explicit kinds always route; unknown only routes with HTTP evidence so a wrapped connection failure isn't mislabeled. transport_disconnect and decompression stay on the transport path.
  • The terminal-only halt suppression: only the terminal recovery message is suppressed for provider API errors; the lifecycle-close / user-cancel halts (retry-race paths) keep their messages.
  • recoveryFor split: terminal provider errors stop; retryable ones (rate_limit / server_overload) must still auto-retry through retryableProviderFailure — confirm the existing retry tests still pass.
  • Schema bump blast radius: both versions go to 2; the decode path preserves each bundle's own version, and the only freshly-stamped assertions updated are in export.test.ts.

Risk Notes

Data-contract change (intentional, tested): two schema versions bump to 2 and two enums gain a provider_api_error value. Additive and backward-compatible — old bundles carry and keep their own schema_version: 1; the enums are not pinned by z.literal, so widening them does not reject old data. No SDK regeneration: these diagnostics enums are not exposed in sdk/js. No UI in this PR (per-kind cards and i18n are a follow-up). Behavior change: a terminal provider API error no longer auto-retries and shows its real message; a retryable provider API error keeps retrying exactly as before.

How To Verify

bun test test/session/run-observability.test.ts test/session/processor-effect.test.ts → 126 pass, 0 fail
  (incl. new: provider_api_error terminal/retryable classification + retry_safety; transport_disconnect/decompression stay transport;
   unknown-without-evidence stays transport, unknown-with-status routes; 402 surfaces real message, not "Connection lost")
bun test test/session src/session test/provider src/provider test/cli/export.test.ts → 1326 pass, 4 skip, 1 todo, 0 fail
tsgo --noEmit (packages/opencode) → clean
codex review (xhigh, 3 rounds) → round 1: 1 P2 (gate unknown on HTTP evidence) fixed; round 2: 1 P2 (preserve lifecycle interruption messages) fixed; round 3 fresh-eye: no P0/P1/P2/P3

Pinned: 402 (kind unknown, status 402) → provider_api_error, real message preserved; quota_exhausted retryable=false → do_not_auto_retry / provider_terminal_failure; rate_limit retryable=true still auto-retries; transport_disconnect / decompression / unknown-without-evidence → transport path; lifecycle-close halt keeps its message.

Screenshots or Recordings

N/A — backend classification + message-passthrough change, no UI.

Checklist

  • Type label — this PR carries exactly one of bug, enhancement, task, documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.
  • Routing labels — this PR carries at least one of app, ui, platform, harness, ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.
  • Priority label — this PR carries exactly one of P0, P1, P2, P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.
  • Human Review Status above is set to Pending, Approved by @<reviewer>, or Not required: <reason> (default is Pending; "not required" is restricted to bot-authored low-risk PRs).
  • I linked the related issue, or stated in Summary why there is no issue.
  • I described the review focus and any meaningful risks.
  • I replaced the example block in How To Verify with the real verification steps and the key result for each.
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope.
  • (conditional) I manually checked visible UI or copy changes when needed, with screenshots or recordings. Leave unticked only if no visible UI or copy changed.
  • (conditional) I considered macOS and Windows impact for platform, packaging, updater, signing, paths, shell, or permissions changes. Leave unticked only if no platform/packaging surface was touched.
  • (conditional) I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant. Leave unticked only if none of those surfaces was touched.
  • I reviewed the final diff for unrelated changes and suspicious dependency changes.
  • I am targeting dev, and my PR title and commit messages use Conventional Commits in English.

https://claude.ai/code/session_015bW9JQSkuB156gkNQdxCzi

Summary by CodeRabbit

  • New Features

    • Added provider API error handling to distinguish API rejections from transport disconnects.
    • Provider-specific error messages (e.g., insufficient balance, quota exceeded) are now preserved and not overwritten by generic connection messages.
  • Improvements

    • Enhanced retry logic to treat provider API errors differently from transport failures.
    • Updated diagnostics schema to better track and categorize provider-specific failures.

…nd stop overwriting their message

A DeepSeek direct account in arrears returns 402 "Insufficient Balance",
but it surfaced as "Connection lost. Please check whether the last
operation completed before resending." Two of the three root-cause layers
live here (the third, 402 classification, is PR #1466):

1. run-incident had only watchdog/transport terminal causes, so every
   provider API rejection was recorded as provider_transport_disconnect.
2. halt() overwrote the real provider message with a generic connection-lost
   recovery string.

Changes:
- New provider_api_error TerminalCause (run-incident) and Classification
  (run-observability), subcategory from providerFailure.kind. Bumps both
  RUN_INCIDENT_SCHEMA_VERSION and RunObservability SCHEMA_VERSION to 2;
  export.test.ts schema/version assertions updated.
- The processor parses the failure once and passes providerFailure (with
  HTTP evidence) to the recorder, which routes a real provider API rejection
  to provider_api_error instead of defaulting it to a transport disconnect.
  classificationForIncident and retrySafetyFor gain provider_api_error
  branches (retryable=false -> provider_terminal_failure, aligned with #1118).
- recoveryFor: a terminal provider API error stops with reason
  provider_api_error (out of the connection-lost set); a retryable one
  (rate_limit / server_overload) flows through the existing auto-retry tree.
- The terminal halt no longer overwrites a provider API rejection's real
  message with the connection-lost recovery string. Lifecycle-close and
  user-cancel halts keep their authoritative interruption messages.
- isProviderApiError gates the catch-all "unknown" kind on HTTP evidence
  (status code or response body) so a wrapped connection failure is not
  mislabeled a provider API error.

Combined with PR #1466 (402 -> quota_exhausted), a billing failure now
surfaces with its real provider message instead of "Connection lost".

Refs #1105, #1123.

Claude-Session: https://claude.ai/code/session_015bW9JQSkuB156gkNQdxCzi
@Astro-Han Astro-Han added the bug Something isn't working label Jun 22, 2026
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Introduces a ProviderApiErrorKind canonical subset and isProviderApiError type guard to distinguish provider API rejections from transport disconnects. Threads structured providerFailure evidence through the session processor, observability recorder, run-incident types/derivation/policy/presentation, and schema versions (bumped to 2), with accompanying tests.

Changes

Provider API Error Classification and Retry Routing

Layer / File(s) Summary
ProviderApiErrorKind type guard and whitelist
packages/opencode/src/provider/error.ts
Defines ProviderApiErrorKind as Exclude<ProviderFailureKind, ...>, creates PROVIDER_API_ERROR_KINDS set, and exports isProviderApiError with special gating for the unknown kind requiring statusCode or hasResponseBody evidence.
Run-incident and run-observability schema types
packages/opencode/src/session/run-incident/types.ts, packages/opencode/src/session/run-observability/types.ts
Bumps both schema versions to 2, extends TerminalCause with a provider_api_error union variant, adds "provider_api_error" to RecoveryDecision.reason and Classification enum, and adds optional providerFailure to recordAttemptFailureAndDeriveRecovery input.
Run-incident derivation and namespace export
packages/opencode/src/session/run-incident/derive.ts, packages/opencode/src/session/run-incident/index.ts
Adds providerApiCause helper constructing a provider_api_error TerminalCause with confidence: "high", re-exported from the RunIncident namespace.
Observability recorder provider-API failure routing
packages/opencode/src/session/run-observability/recorder.ts
Adds "provider_api" Failure variant, introduces recordProviderApiFailureEvidence, routes providerFailure through isProviderApiError detection, and extends classify, summarySuffix, and retrySafetyFor to handle provider_api_error separately from transport.
Run-incident retry policy
packages/opencode/src/session/run-incident/policy.ts
Introduces retryableProviderFailure combining retryable provider_api_error with transport/watchdog causes, adds a non-retryable provider_api_error early guard returning do_not_retry, and replaces all retryableTransport gates with retryableProviderFailure.
Run-incident presentation
packages/opencode/src/session/run-incident/presentation.ts
Adds a dedicated plainSummary string for provider_api_error and classifies it as "error" severity.
Session processor wiring
packages/opencode/src/session/processor.ts
Reworks retrySignalFor to extract structured providerFailure from APIError, forwards it to recordAttemptFailureAndDeriveRecovery, and suppresses the generic interruption message when isProviderApiError confirms a provider API rejection.
Tests
packages/opencode/test/session/run-observability.test.ts, packages/opencode/test/session/processor-effect.test.ts, packages/opencode/test/session/export.test.ts
Adds unit tests for provider API vs transport routing (quota_exhausted, rate_limit, decompression, unknown with/without HTTP evidence), a live processor test asserting real error surfacing without "Connection lost" overwrite, and export test schema version bumps to 2.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • Astro-Han/pawwork#812: This PR extends the same RunIncident/recorder/export framework introduced in #812 by adding provider_api_error terminal-cause handling inside the existing derivation/classification/export pipeline.
  • Astro-Han/pawwork#914: Both PRs modify recoveryFor and the canAutoRetryBeforeFirstProviderProgress fast-path in run-incident/policy.ts, with this PR expanding retryability gating to cover provider API failures.
  • Astro-Han/pawwork#1008: Both PRs modify the run-incident retry surface—auto_retry/max-attempt behavior in policy.ts, types.ts, and presentation.ts—with this PR building new provider_api_error retry decisions on top of the same retry/recovery recommendation logic.

Suggested labels

P2, harness

🐇 A rabbit hopped through the error maze,
Found "quota_exhausted" lost in transport's haze.
"You're an API error, not a broken stream!"
Now kinds get classified with confidence supreme.
No more "Connection lost" to mask the truth—
The provider speaks clearly, and that's the proof! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: classifying provider API rejections as a terminal cause and preventing message overwriting.
Description check ✅ Passed The description is comprehensive and well-structured, covering all required template sections including summary, why, related issues, human review status, review focus, risk notes, verification steps, and completed checklist items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/i1105-pr2-run-incident-provider-api

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install failed. For unrecoverable errors, disable the tool in CodeRabbit configuration.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority labels Jun 22, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/opencode/src/session/processor.ts (1)

1732-1760: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Single-parse design with structured evidence forwarding.

The rework correctly parses the error once and extracts providerFailure with all necessary fields (kind, code, statusCode, hasResponseBody) for downstream routing. The hasResponseBody check is appropriately defensive (empty string or non-string types yield false).

♻️ Optional: Extract inline type for clarity

The inline type annotation at lines 1734-1736 could be extracted to a named type if this return shape is referenced elsewhere or grows more complex:

type RetrySignalProviderFailure = {
  kind: ProviderFailureKind
  code?: string
  statusCode?: number
  hasResponseBody?: boolean
}

However, since this is a local return type and not referenced elsewhere, the current inline form is acceptable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/opencode/src/session/processor.ts` around lines 1732 - 1760, The
current implementation with the inline type annotation for the return shape
(containing kind, code, statusCode, and hasResponseBody fields) is acceptable
and correctly handles the providerFailure extraction with defensive checks. The
refactoring to extract this inline type definition into a named type like
RetrySignalProviderFailure is purely optional for improved code clarity and
maintainability, and should only be pursued if this return type shape is
referenced elsewhere in the codebase or if the team prefers explicit named types
for consistency. No mandatory changes are required.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/opencode/src/session/processor.ts`:
- Around line 1732-1760: The current implementation with the inline type
annotation for the return shape (containing kind, code, statusCode, and
hasResponseBody fields) is acceptable and correctly handles the providerFailure
extraction with defensive checks. The refactoring to extract this inline type
definition into a named type like RetrySignalProviderFailure is purely optional
for improved code clarity and maintainability, and should only be pursued if
this return type shape is referenced elsewhere in the codebase or if the team
prefers explicit named types for consistency. No mandatory changes are required.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 012387f9-11d0-4371-ae8f-21eb52dc6c8c

📥 Commits

Reviewing files that changed from the base of the PR and between f433411 and a2daaa2.

📒 Files selected for processing (12)
  • packages/opencode/src/provider/error.ts
  • packages/opencode/src/session/processor.ts
  • packages/opencode/src/session/run-incident/derive.ts
  • packages/opencode/src/session/run-incident/index.ts
  • packages/opencode/src/session/run-incident/policy.ts
  • packages/opencode/src/session/run-incident/presentation.ts
  • packages/opencode/src/session/run-incident/types.ts
  • packages/opencode/src/session/run-observability/recorder.ts
  • packages/opencode/src/session/run-observability/types.ts
  • packages/opencode/test/session/export.test.ts
  • packages/opencode/test/session/processor-effect.test.ts
  • packages/opencode/test/session/run-observability.test.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant