[P1 #261] runtime utils: withTimeout + classifyRuntimeResult (covers codex/grok/telegram in one pass)#272
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f4f7ef629f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // #261 P1 redirect — handshake decoupled from prompt deadline. | ||
| // 45s default keeps wedged initialize / authenticate / session | ||
| // bounded without affecting long-running prompts. | ||
| handshakeTimeoutMs: GROK_HANDSHAKE_TIMEOUT_MS, |
There was a problem hiding this comment.
Respect shorter Grok prompt timeouts for handshakes
This affects Grok nodes that set GROK_ACP_TIMEOUT_MS or flags.grokAcpTimeoutMs below 45s: this explicit default handshakeTimeoutMs bypasses runGrokAcpTurn's min(45_000, timeoutMs) fallback, so a node that previously bounded initialize/auth/session by (for example) 10s now waits the full 45s unless it also discovers and sets the new handshake knob. Compute the handshake default from the resolved prompt timeout, or only pass this option when the new knob is explicitly configured.
Useful? React with 👍 / 👎.
| const res = await withTimeout( | ||
| (signal) => fetch(`${tg.apiBase}/getUpdates?offset=${tg.offset}&timeout=30`, { signal }), | ||
| TELEGRAM_GETUPDATES_TIMEOUT_MS, | ||
| "telegram-getUpdates", | ||
| ); |
There was a problem hiding this comment.
Keep the Telegram body read under the timeout
This only bounds the fetch() promise, which resolves when response headers arrive; if Telegram or an intermediate connection sends headers and then stalls before the JSON body completes, the timer is cleared and the following res.json() can still pin the polling loop indefinitely. Keep the body read inside the withTimeout callback (return the parsed JSON there) so the abort signal and 45s deadline cover the entire getUpdates request.
Useful? React with 👍 / 👎.
| } | ||
| } | ||
| if (timer) clearTimeout(timer); | ||
| }, CLAUDE_TIMEOUT_MS, `claude-attempt-${attempt + 1}/${CLAUDE_MAX_RETRIES + 1}`); |
There was a problem hiding this comment.
Validate Claude timeout before passing it through
If CLAUDE_TIMEOUT_MS, --claude-timeout-ms, or the config value is a non-numeric string, parseInt produces NaN; the old CLAUDE_TIMEOUT_MS > 0 guard simply disabled the timer in that case, but withTimeout only treats <= 0 as disabled and passes NaN to setTimeout, which fires immediately. A typo in the timeout knob will therefore make every Claude attempt time out instantly; resolve or validate this value before calling withTimeout.
Useful? React with 👍 / 👎.
| const GROK_HANDSHAKE_TIMEOUT_MS = resolveTimeoutMs({ | ||
| envValue: process.env.GROK_HANDSHAKE_TIMEOUT_MS, | ||
| flagValue: typeof fileConfig.flags?.grokHandshakeTimeoutMs === "number" ? fileConfig.flags.grokHandshakeTimeoutMs : undefined, | ||
| defaultMs: 45_000, | ||
| }).valueMs; |
There was a problem hiding this comment.
Reject zero for the Grok handshake timeout
When a Grok node sets GROK_HANDSHAKE_TIMEOUT_MS=0 or flags.grokHandshakeTimeoutMs: 0, this resolver preserves the zero value, but the Grok path later passes it to GrokAcpClient.request(), whose raw setTimeout(..., 0) makes initialize/auth/session time out on the next tick rather than disabling the deadline. Either reject/clamp zero for this knob like resolveGrokAcpTimeout does, or teach the Grok request path to treat zero as disabled.
Useful? React with 👍 / 👎.
| result = `执行出错: vendor 返回空响应 (in=${u.input_tokens || 0} out=${u.output_tokens || 0}) — 疑似 vendor 静默限流/配额. ${hint}`; | ||
| } else { | ||
| result = m.result; | ||
| result = await withTimeout(async (signal) => { |
There was a problem hiding this comment.
Avoid overlapping Claude retries after timeout
When CLAUDE_TIMEOUT_MS fires and CLAUDE_MAX_RETRIES is nonzero, withTimeout rejects as soon as it aborts the signal, so the retry loop can start the next query() after the backoff even if the previous SDK turn is still unwinding or running a tool. The old code only called ac.abort() and did not leave the for await until the SDK settled, so slow/abort-ignoring tools could not overlap; now the same task can be executed twice and duplicate side effects such as file edits or send_task calls. On timeout, either avoid retrying or wait for the aborted attempt to finish cleanup before starting another attempt.
Useful? React with 👍 / 👎.
…classifier
Replace the planned point-fixes (codex turn timeout / grok handshake decouple) with two shared utils that cover all runtime call sites at once.
util ① — src/util/timeout.ts (158 LOC)
- withTimeout(fn(signal) => Promise<T>, ms, label, opts?) — one signature
- TimeoutError carries label + ms for log-side surfacing
- externalSignal forwarding lets callers participate in caller-supplied cancellation
- ms<=0 sentinel preserves CLAUDE_TIMEOUT_MS=0 "disabled" behaviour
- resolveTimeoutMs({envValue,flagValue,defaultMs,minMs,maxMs}) — env>flag>default precedence with clamping, generalises resolveGrokAcpTimeout
util ② — src/runtime/classify-result.ts (157 LOC)
- classifyRuntimeResult({result,usage,totalCostUsd,errorMessage}) →
success | soft-fail-empty | soft-fail-quota | error
- Folds #267 isRateLimitOrQuotaError into one cross-runtime decision
surface; adds a strict empty-result check and a three-zero
silent-reject rule (in=0 AND out=0 AND cost=0).
- Detection is intentionally strict: output_tokens=0 ALONE is not a
flag-trigger when result text is non-empty, because codex's usage
field is not reliably reported across providers. Empty result OR
the full three-zero triple flag; nothing else does.
- formatClassificationError() — canonical "执行出错: <runtime> ..." shape
that the IM bridge / dashboard parse to flag failure rows
- Re-exports building blocks so direct callers (auth-error fast-fail in
cli.ts) don't bounce through the wider surface
5 consumers wired
- claude (cli.ts processWithClaude): for-await loop now under
withTimeout(CLAUDE_TIMEOUT_MS), success branch delegates to
classifyRuntimeResult so empty-result rebrand is gone
- codex (cli.ts processWithCodex): ZERO wall-clock pre-fix → now
withTimeout(CODEX_TIMEOUT_MS, "codex-turn"); SDK TurnOptions.signal
propagates the abort cleanly. TimeoutError + 429 paths short-circuit
the thread rebuild loop. Classifier flags silent-reject reply.
- grok handshake (runtime/grok-build-acp/runtime.ts): new
handshakeTimeoutMs option (default min(45s, timeoutMs)) for
initialize+authenticate+session/new|load — decoupled from the 300s
session/prompt deadline so a wedged handshake fails fast.
cli.ts only forwards an explicit env / flag value; absent both, it
passes undefined so the runtime default applies (back-compat for
callers that set tight GROK_ACP_TIMEOUT_MS).
- telegram getUpdates (cli.ts): wrapped in withTimeout(45s) with
AbortSignal propagated into fetch — TCP wedge can no longer pin the
poll loop forever
- think() queue: implicitly bounded — each runtime call inside the
queue is now under its own withTimeout, no separate wrap needed
Verification
- Unit tests: 29 (timeout: happy/timeout/sentinel/external-signal/
resolver precedence/clamping/null) + 26 (classify: error precedence
/ three-zero strict / empty-result rule / codex false-positive
guard / vendor hint routing / formatter shape) = 55 new
- Full agent-node suite: 274 → 329 pass, 0 fail
- Functional smoke: 5 cases (timeout fires in 102ms / three-zero
detected / 429 routes to vendor dashboard / format msg starts with
expected prefix / real success classifies correctly) — all PASS
- bun build clean (0.39 MB cli.js, 106 modules)
Why one PR not split (per redirect dispatch)
Pure helpers + tests + glue together = single review pass. Splitting
would force the reviewer to re-context twice for what is logically
one cross-cutting change. #267 followed the same pattern.
f4f7ef6 to
712f210
Compare
* release(v0.11-preview1): bump 3 packages + release notes + PINNED audit Versions ======== - @sleep2agi/agent-network 2.2.22-preview.4 → 2.3.0-preview.0 - @sleep2agi/agent-node 2.4.15-preview.2 → 2.5.0-preview.0 - @sleep2agi/commhub-server 0.8.8 → 0.9.0-preview.0 PINNED_SERVER_VERSION (agent-network/bin/cli.ts) bumped to "0.9.0-preview.0" so `anet hub start` lazy-fetches the matching hub binary. Without this pin update, hub start silently hangs (#194 class) because npx resolves to a published version that no longer matches what the CLI expects. Release notes ============= docs/tests/release-v2.3.0-preview.0.md — contains the required ## Install (new user) and ## Upgrade (existing user) sections for the release-gate Gate 3 check. Lists every change in this preview: - P0-1 feishu worker supervised re-fork (#263) - P0-2 hub default credentials randomised + must_change_password (#264) - Runtime utils — withTimeout + classifyRuntimeResult (#272) - 429/quota fast-fail + empty-result soft-fail (folded into #272) - Cross-tenant write blocker (#275) - SSE memory-leak fix - B1 telegram allowFrom fail-closed (#276 — lands in preview1 batch) - B2 .anet/ auto-gitignore (#278 — lands in preview1 batch) - Slug guard + 6 P0 cleanups (#274) - Release-gate workflow (#270) - 5 onboarding robustness fixes - Feishu quickstart docs Migration callout: telegram empty/missing allowFrom now fail-closed (was: allow-all). Recovery is `"allowFrom": ["*"]` in access.json. Boot-time warn surfaces the new posture on first message. Verification (pre-publish) ========================== - Docker clean install: node:22-bookworm-slim + bun, 3 tarballs from absolute paths, `anet --version` → 2.3.0-preview.0; component resolution shows all 3 versions; `commhub-server` boots and serves /health at the new version - Docker post-publish: `anet hub start` lazy-fetches the published commhub-server@0.9.0-preview.0 and serves /health with version 0.9.0-preview.0; admin token saved at mode 600 with random bootstrap password (P0-2 verified live) - PINNED audit: source / Docker / npm all agree on 0.9.0-preview.0 - npm publish --tag preview from absolute tarball paths (no github short-link resolution risk) dist-tags after publish ======================= @sleep2agi/agent-network { latest: 2.2.21, preview: 2.3.0-preview.0 } @sleep2agi/agent-node { latest: 2.4.13, preview: 2.5.0-preview.0 } @sleep2agi/commhub-server { latest: 0.8.8, preview: 0.9.0-preview.0 } @latest is unchanged; promotion is a separate manual step after Vincent sign-off on the preview1 channel. * docs(release-v2.3.0-preview.0): inline tag literals + Install heading versions for release-gate
Author
Agent: 通信工程马
Refs: #261 (P1 redirect — supersedes planned P1-② codex-timeout + P1-③ grok-handshake point-fixes)
Why (per redirect dispatch — leverage > point-fixes)
Review round 4 systemic 提炼: don't ship 2-3 separate point-fixes for codex / grok / telegram timeouts and quota detection. Instead build 2 shared utils that every runtime call site consumes — covers all current failure shapes AND inoculates the codebase against the same bug class returning on the next runtime added.
#267 (already merged) is the seed: pure-helper + 38 unit tests + drop-in cli.ts glue. This PR generalises it to a cross-runtime decision surface AND adds the timeout primitive that was missing entirely on codex (zero wall-clock guard pre-fix).
Changes (6 files, +991 / -67)
util ① —
agent-node/src/util/timeout.ts(NEW, 158 LOC)withTimeout(fn(signal) => Promise<T>, ms, label, opts?)— single signature for all timeout-bounded async workTimeoutErrorcarrieslabel+timeoutMsso log-side surfaces don't have to re-stringifyexternalSignalforwarding lets callers participate in caller-supplied cancellationms <= 0sentinel preserves the pre-existingCLAUDE_TIMEOUT_MS=0 → disabledbehaviourresolveTimeoutMs({envValue, flagValue, defaultMs, minMs, maxMs})— env > flag > default precedence with clamping (generalisesresolveGrokAcpTimeout)util ② —
agent-node/src/runtime/classify-result.ts(NEW, 157 LOC)classifyRuntimeResult({result, usage, totalCostUsd, errorMessage}) → { kind: success | soft-fail-empty | soft-fail-quota | error, reason, hint }isRateLimitOrQuotaErrorinto one cross-runtime decision shared by claude / codex / groknull/undefined/""flags soft-fail-empty.output_tokens === 0alone with non-empty text is NOT a flag-trigger — codex's usage field is not reliably reported across providers, and the previous OR-rule false-positived on healthy codex turnsin=0 AND out=0 AND cost=0(all three required) → soft-fail-empty. Captures the M3 silent-429 episode without penalising codex's optional cost reportingformatClassificationError()— canonical执行出错: <runtime> ...shape the IM bridge / dashboard parse5 consumers wired
processWithClaudewithTimeout(CLAUDE_TIMEOUT_MS, "claude-attempt-N/M")with signal forwarded intooptions.abortControllerprocessWithCodexwithTimeout(CODEX_TIMEOUT_MS, "codex-turn")with SDKTurnOptions.signalforwarded;TimeoutError+ 429 fast-fail short-circuit thread rebuildsession/prompt— wedged handshake hid behind prompt deadlinehandshakeTimeoutMsoption (defaultmin(45s, timeoutMs)); prompt deadline untouched. Back-compat: cli.ts only forwards an explicit env / flag value; absent both, passesundefinedso the runtime default applies (preserves tightGROK_ACP_TIMEOUT_MSsemantics for old callers)withTimeout(45s)withAbortSignalpropagated intofetchwithTimeoutResult classifier wired in claude + codex success paths
m.result || "任务完成"→classifyRuntimeResult({result, usage, totalCostUsd})— empty-result rebrand is goneoutcome.finalResponse || "(无回复)"→ same classifier (no more silent reject masquerading as a real reply)Verification
Unit tests — 55 new pure-helper tests
Full agent-node test suite
bun test src/— 329 pass / 0 fail (was 274 pre-PR)Functional smoke — 5 end-to-end cases (all PASS)
Build
bun run build→ 0.39 MB cli.js, 106 modules, no warnings.Knobs added (opt-in; defaults match prior behaviour where possible)
CODEX_TIMEOUT_MScodexTimeoutMsGROK_HANDSHAKE_TIMEOUT_MSgrokHandshakeTimeoutMsmin(45s, timeoutMs)when unset (was: 300000 shared with prompt)TELEGRAM_GETUPDATES_TIMEOUT_MSCLAUDE_TIMEOUT_MS/GROK_ACP_TIMEOUT_MS/GROK_ACP_DRAIN_MSare unchanged.v2 changes (from reviewer pass)
undefinedso the runtime'smin(45s, timeoutMs)keeps protecting callers with a tightGROK_ACP_TIMEOUT_MS.🤖 Generated with Claude Code