Skip to content

Runtime: transport-aware retry policy for model connection failures #54

@jafreck

Description

@jafreck

Problem

  • Migration runs repeatedly fail with transient upstream transport failures (HTTP/2 GOAWAY, CAPIError 503), causing task churn and manual reruns.

Proposal

  • Add error classification for transient transport failures vs deterministic failures.
  • For transient errors, apply exponential backoff with jitter and a per-model/per-agent circuit breaker.
  • Keep deterministic failures on fast-fail paths.
  • Emit structured retry telemetry (error_class, attempt, backoff_ms, breaker_state).

Acceptance Criteria

  • Transient transport failures are retried with exponential backoff + jitter.
  • Deterministic/schema failures are not endlessly retried.
  • Circuit breaker opens after configurable threshold and cools down.
  • Logs clearly show classification and retry decisions.
  • Add/update tests for retry classification and breaker transitions.

Context reference

  • Observed repeated GOAWAY/503 agent failures during zstd migration runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions