fix(runtime): require health-probe before declaring startup ready#541
Open
XingYu-Zhong wants to merge 2 commits into
Open
fix(runtime): require health-probe before declaring startup ready#541XingYu-Zhong wants to merge 2 commits into
XingYu-Zhong wants to merge 2 commits into
Conversation
…tup ready Root cause: the GUI resolved startup on the stdout KUN_READY marker alone, without verifying the HTTP server could actually respond. On Windows, the server could be listening but temporarily unable to serve (antivirus scanning, event-loop stalls from native module loading), causing a 20-second health timeout followed by a crash-restart loop. Three-layer fix: 1. serve-entry: self-verify health before emitting KUN_READY — the kun process proves its own HTTP handler works by fetching /health from itself (5s timeout, proceeds with warning on failure). 2. kun-process: waitForKunStartup no longer settles on the stdout marker alone when a health timer is available. A successful health probe is now REQUIRED (health alone still settles as before, preserving the fallback for lost stdout markers). This prevents the GUI from declaring success before the server actually responds. 3. index: waitForKunHealth logs probe errors (deduplicated) via logWarn so health failures are visible in the GUI logger for diagnostics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…544) Companion to the health-probe gate in this PR. #544 shows Windows 11 stuck in a "Kun did not report ready within 45000ms" -> SIGTERM -> respawn loop, with some children also killed at 3-35s (before the deadline). Two root causes: 1. The 45s startup deadline is too short. kun emits KUN_READY (and only then can /health respond) after startKunServe() resolves, which blocks on sqlite open, thread-store backfill, per-thread usage carryover, and the 10s MCP fast-connect race -- all before the HTTP server listens. A slow Windows disk (antivirus) with a large history exceeds 45s. Fix: KUN_STARTUP_TIMEOUT_MS is now resolveKunStartupTimeoutMs(platform, env) -- 90s on Windows, 60s elsewhere, overridable via the KUN_STARTUP_TIMEOUT_MS env var (clamped 15s-10min). The ceiling is free on fast machines: the parallel /health probe settles the moment the server responds, and a real crash rejects immediately via the exit event instead of waiting out the timeout. 2. Restart storm. The settings/MCP-apply paths read isChildRunning() and stopAndWait() a child that is still mid-startup, throwing away the boot's progress so it can never finish. Fix: waitForKunStartupSettled() (awaits the in-flight launch, swallowing errors) is awaited at the top of restartManagedRuntimeForSettingsChange, restartManagedRuntimeForMcpConfigChange, and restartRuntimeOnce, so a slow-but-healthy boot is no longer interrupted. Deadlock-safe: the start promise is only set after those apply paths pass the settings-apply gate, so they never wait on a launch that waits on them. Tests: resolveKunStartupTimeoutMs platform/env matrix and waitForKunStartupSettled behavior. GUI typecheck + full suite + build green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
KUN_READY标记就认为启动成功,但该标记只证明进程启动了、TCP 端口已绑定——并不能证明 HTTP 服务器能真正响应请求。在 Windows 上,杀毒软件扫描、原生模块加载延迟等因素会造成"已监听但无法服务"的窗口期,导致后续 20 秒健康检查超时 → 进程被杀 → 重启循环。serve-entry:KUN_READY 前自检/health(5 秒超时,失败写 stderr 警告后仍继续)kun-process:waitForKunStartup不再仅凭 stdout 标记就 resolve——当健康探针可用时,必须有至少一次探针通过才算启动成功(健康探针单独通过仍可 settle,保留 stdout 丢失场景的兜底)index:waitForKunHealth记录探针错误到logWarn(去重),放弃时打日志,方便诊断Test plan
tsconfig.web.json+tsconfig.node.json) 通过🤖 Generated with Claude Code