Description Problem
Migration runs repeatedly fail with transient upstream transport failures (HTTP/2 GOAWAY, CAPIError 503), causing task churn and manual reruns.
Proposal
Add error classification for transient transport failures vs deterministic failures.
For transient errors, apply exponential backoff with jitter and a per-model/per-agent circuit breaker.
Keep deterministic failures on fast-fail paths.
Emit structured retry telemetry (error_class, attempt, backoff_ms, breaker_state).
Acceptance Criteria
Transient transport failures are retried with exponential backoff + jitter.
Deterministic/schema failures are not endlessly retried.
Circuit breaker opens after configurable threshold and cools down.
Logs clearly show classification and retry decisions.
Add/update tests for retry classification and breaker transitions.
Context reference
Observed repeated GOAWAY/503 agent failures during zstd migration runs.
Reactions are currently unavailable
You can’t perform that action at this time.
Problem
Proposal
Acceptance Criteria
Context reference