tools: add outbound retry policy, DLQ, and per-host rate limits (#7, #8, #9)#22
Merged
Conversation
…, #9) Implements three tightly related outbound tool resilience features: **#7 — Per-tool retry policy** - `model.ToolRetryConfig` (max_attempts, base_delay, max_delay, honor_retry_after) wired into `model.ToolDefinition.Retry` - `isTransient` classifies 429/5xx/network as transient; other 4xx permanent - Exponential backoff with 10% jitter; `Retry-After` header honoured when set - Zero `ToolRetryConfig` = single attempt (backward-compatible; no retry) - `retry:` block added to `schemas/skill-1.schema.yaml` **#8 — Outbound DLQ and requeue** - Exhausted/permanent tool failures enqueued to `outbound-dlq` when `Executor.QueueMgr != nil`; `QueueItem.ToolName`/`ToolTarget` carry context - New `leather dlq inspect` / `leather dlq requeue` CLI subcommands - `<item-id>` must be last (after all flags) due to Go flag.FlagSet behaviour **#9 — Per-host rate limits and metrics** - `internal/tool/ratelimit.go`: stdlib token-bucket `HostLimiter`; nil-safe `Wait`; rate spec format `"N/s"`, `"N/m"`, `"N/h"` - `config.yaml tools.rate_limits:` parsed into `model.Config.ToolRateLimits` - Package-level atomic counters (`retry_total`, `backoff_total`, `rate_limit_wait_total`) exposed via `tool.MetricSnapshot()` - `/metrics` gains `leather_tool_retry_total`, `leather_tool_backoff_total`, `leather_tool_rate_limit_wait_total`, `leather_outbound_dlq_depth` All new tests pass under `-race`; `make ci` green.
Add Step 4 (strike shipped items in ROADMAP.md, update _Last reviewed_) and Step 5 (update Supported Versions table in SECURITY.md) to the release-prep skill. Renumber downstream steps; add both files to git add and the pre-handoff checklist.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #7, closes #8, closes #9.
Summary
Implements three outbound tool resilience features from the v0.2 roadmap, plus an improvement to the release-prep skill.
#7 — Per-tool retry policy
ToolRetryConfig(max_attempts, base_delay, max_delay, honor_retry_after) wired intoToolDefinition.RetryisTransientclassifies 429/5xx/network errors as transient; other 4xx permanentRetry-Afterheader honoured when setToolRetryConfig= single attempt — fully backward-compatible; existing tools unchangedretry:block added toschemas/skill-1.schema.yaml#8 — Outbound DLQ and requeue
outbound-dlqwhenExecutor.QueueMgr != nilleather dlq inspect/leather dlq requeueCLI subcommandsQueueItem.ToolName/ToolTargetfields (omitempty — existing JSONL files unaffected)#9 — Per-host rate limits and metrics
internal/tool/ratelimit.go: stdlib token-bucketHostLimiter; nil-safeWaitconfig.yaml tools.rate_limits:parsed intomodel.Config.ToolRateLimits/metricscounters:leather_tool_retry_total,leather_tool_backoff_total,leather_tool_rate_limit_wait_total,leather_outbound_dlq_depthrelease-prep skill
ROADMAP.md, update_Last reviewed_SECURITY.mdBackwards compatibility
All existing examples and configs are unaffected.
retry:is optional (omission = single attempt),tools.rate_limits:only activates when present, and newQueueItemfields areomitempty.Test plan
make cipasses (check + test-race + lint + integration)go test -race ./internal/tool/...— retry, DLQ, rate-limit, and metrics testsgo test -race ./internal/cli/...—TestRunDLQ*suite witht.TempDir()state dirsgo build ./...cleanleather dlq inspectlists it,leather dlq requeuemoves ittools.rate_limits: {localhost: "1/s"}in config; confirm second rapid call is delayed; check/metricsforleather_tool_rate_limit_wait_total > 0