Skip to content

tools: add outbound retry policy, DLQ, and per-host rate limits (#7, #8, #9)#22

Merged
TGPSKI merged 3 commits into
mainfrom
feat/outbound-tool-resilience
Jun 6, 2026
Merged

tools: add outbound retry policy, DLQ, and per-host rate limits (#7, #8, #9)#22
TGPSKI merged 3 commits into
mainfrom
feat/outbound-tool-resilience

Conversation

@TGPSKI

@TGPSKI TGPSKI commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Closes #7, closes #8, closes #9.

Summary

Implements three outbound tool resilience features from the v0.2 roadmap, plus an improvement to the release-prep skill.

#7 — Per-tool retry policy

  • ToolRetryConfig (max_attempts, base_delay, max_delay, honor_retry_after) wired into ToolDefinition.Retry
  • isTransient classifies 429/5xx/network errors as transient; other 4xx permanent
  • Exponential backoff with 10% jitter; Retry-After header honoured when set
  • Zero ToolRetryConfig = single attempt — fully backward-compatible; existing tools unchanged
  • retry: block added to schemas/skill-1.schema.yaml

#8 — Outbound DLQ and requeue

  • Exhausted/permanent tool failures enqueued to outbound-dlq when Executor.QueueMgr != nil
  • New leather dlq inspect / leather dlq requeue CLI subcommands
  • QueueItem.ToolName / ToolTarget fields (omitempty — existing JSONL files unaffected)

#9 — Per-host rate limits and metrics

  • internal/tool/ratelimit.go: stdlib token-bucket HostLimiter; nil-safe Wait
  • config.yaml tools.rate_limits: parsed into model.Config.ToolRateLimits
  • Four new /metrics counters: leather_tool_retry_total, leather_tool_backoff_total, leather_tool_rate_limit_wait_total, leather_outbound_dlq_depth

release-prep skill

  • Added Step 4: strike shipped items in ROADMAP.md, update _Last reviewed_
  • Added Step 5: update Supported Versions table in SECURITY.md

Backwards compatibility

All existing examples and configs are unaffected. retry: is optional (omission = single attempt), tools.rate_limits: only activates when present, and new QueueItem fields are omitempty.

Test plan

  • make ci passes (check + test-race + lint + integration)
  • go test -race ./internal/tool/... — retry, DLQ, rate-limit, and metrics tests
  • go test -race ./internal/cli/...TestRunDLQ* suite with t.TempDir() state dirs
  • go build ./... clean
  • Manual smoke: run an HTTP tool against a mock 500 server; confirm retry fires, DLQ item appears, leather dlq inspect lists it, leather dlq requeue moves it
  • Set tools.rate_limits: {localhost: "1/s"} in config; confirm second rapid call is delayed; check /metrics for leather_tool_rate_limit_wait_total > 0

TGPSKI added 2 commits June 5, 2026 20:53
…, #9)

Implements three tightly related outbound tool resilience features:

**#7 — Per-tool retry policy**
- `model.ToolRetryConfig` (max_attempts, base_delay, max_delay,
  honor_retry_after) wired into `model.ToolDefinition.Retry`
- `isTransient` classifies 429/5xx/network as transient; other 4xx permanent
- Exponential backoff with 10% jitter; `Retry-After` header honoured when set
- Zero `ToolRetryConfig` = single attempt (backward-compatible; no retry)
- `retry:` block added to `schemas/skill-1.schema.yaml`

**#8 — Outbound DLQ and requeue**
- Exhausted/permanent tool failures enqueued to `outbound-dlq` when
  `Executor.QueueMgr != nil`; `QueueItem.ToolName`/`ToolTarget` carry context
- New `leather dlq inspect` / `leather dlq requeue` CLI subcommands
- `<item-id>` must be last (after all flags) due to Go flag.FlagSet behaviour

**#9 — Per-host rate limits and metrics**
- `internal/tool/ratelimit.go`: stdlib token-bucket `HostLimiter`; nil-safe
  `Wait`; rate spec format `"N/s"`, `"N/m"`, `"N/h"`
- `config.yaml tools.rate_limits:` parsed into `model.Config.ToolRateLimits`
- Package-level atomic counters (`retry_total`, `backoff_total`,
  `rate_limit_wait_total`) exposed via `tool.MetricSnapshot()`
- `/metrics` gains `leather_tool_retry_total`, `leather_tool_backoff_total`,
  `leather_tool_rate_limit_wait_total`, `leather_outbound_dlq_depth`

All new tests pass under `-race`; `make ci` green.
Add Step 4 (strike shipped items in ROADMAP.md, update _Last reviewed_)
and Step 5 (update Supported Versions table in SECURITY.md) to the
release-prep skill. Renumber downstream steps; add both files to git add
and the pre-handoff checklist.
@TGPSKI TGPSKI merged commit a6998ef into main Jun 6, 2026
3 checks passed
@TGPSKI TGPSKI deleted the feat/outbound-tool-resilience branch June 6, 2026 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tools: add outbound rate limits and metrics tools: add outbound DLQ and requeue tools: add outbound retry policy

1 participant