Skip to content

fix(runtime): require health-probe before declaring startup ready#541

Open
XingYu-Zhong wants to merge 2 commits into
developfrom
fix/windows-runtime-health
Open

fix(runtime): require health-probe before declaring startup ready#541
XingYu-Zhong wants to merge 2 commits into
developfrom
fix/windows-runtime-health

Conversation

@XingYu-Zhong

Copy link
Copy Markdown
Collaborator

Summary

  • 修复 Windows 11 运行时连接失败:GUI 原先仅凭 stdout KUN_READY 标记就认为启动成功,但该标记只证明进程启动了、TCP 端口已绑定——并不能证明 HTTP 服务器能真正响应请求。在 Windows 上,杀毒软件扫描、原生模块加载延迟等因素会造成"已监听但无法服务"的窗口期,导致后续 20 秒健康检查超时 → 进程被杀 → 重启循环。
  • 三层防护
    1. serve-entry:KUN_READY 前自检 /health(5 秒超时,失败写 stderr 警告后仍继续)
    2. kun-processwaitForKunStartup 不再仅凭 stdout 标记就 resolve——当健康探针可用时,必须有至少一次探针通过才算启动成功(健康探针单独通过仍可 settle,保留 stdout 丢失场景的兜底)
    3. indexwaitForKunHealth 记录探针错误到 logWarn(去重),放弃时打日志,方便诊断

Test plan

  • GUI typecheck (tsconfig.web.json + tsconfig.node.json) 通过
  • kun typecheck 通过
  • 全部 1707 GUI 测试通过
  • kun 测试仅 2 个 develop 预存红测试失败(skill-runtime + memory-store,与本 PR 无关)
  • Windows 11 真机验证:安装后首次启动不再出现"无法连接到本地运行时"循环

🤖 Generated with Claude Code

XingYu-Zhong and others added 2 commits June 23, 2026 22:51
…tup ready

Root cause: the GUI resolved startup on the stdout KUN_READY marker
alone, without verifying the HTTP server could actually respond. On
Windows, the server could be listening but temporarily unable to serve
(antivirus scanning, event-loop stalls from native module loading),
causing a 20-second health timeout followed by a crash-restart loop.

Three-layer fix:

1. serve-entry: self-verify health before emitting KUN_READY — the kun
   process proves its own HTTP handler works by fetching /health from
   itself (5s timeout, proceeds with warning on failure).

2. kun-process: waitForKunStartup no longer settles on the stdout marker
   alone when a health timer is available. A successful health probe is
   now REQUIRED (health alone still settles as before, preserving the
   fallback for lost stdout markers). This prevents the GUI from
   declaring success before the server actually responds.

3. index: waitForKunHealth logs probe errors (deduplicated) via logWarn
   so health failures are visible in the GUI logger for diagnostics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…544)

Companion to the health-probe gate in this PR. #544 shows Windows 11
stuck in a "Kun did not report ready within 45000ms" -> SIGTERM ->
respawn loop, with some children also killed at 3-35s (before the
deadline). Two root causes:

1. The 45s startup deadline is too short. kun emits KUN_READY (and only
   then can /health respond) after startKunServe() resolves, which blocks
   on sqlite open, thread-store backfill, per-thread usage carryover, and
   the 10s MCP fast-connect race -- all before the HTTP server listens. A
   slow Windows disk (antivirus) with a large history exceeds 45s.
   Fix: KUN_STARTUP_TIMEOUT_MS is now resolveKunStartupTimeoutMs(platform,
   env) -- 90s on Windows, 60s elsewhere, overridable via the
   KUN_STARTUP_TIMEOUT_MS env var (clamped 15s-10min). The ceiling is free
   on fast machines: the parallel /health probe settles the moment the
   server responds, and a real crash rejects immediately via the exit
   event instead of waiting out the timeout.

2. Restart storm. The settings/MCP-apply paths read isChildRunning() and
   stopAndWait() a child that is still mid-startup, throwing away the
   boot's progress so it can never finish. Fix: waitForKunStartupSettled()
   (awaits the in-flight launch, swallowing errors) is awaited at the top
   of restartManagedRuntimeForSettingsChange,
   restartManagedRuntimeForMcpConfigChange, and restartRuntimeOnce, so a
   slow-but-healthy boot is no longer interrupted. Deadlock-safe: the start
   promise is only set after those apply paths pass the settings-apply
   gate, so they never wait on a launch that waits on them.

Tests: resolveKunStartupTimeoutMs platform/env matrix and
waitForKunStartupSettled behavior. GUI typecheck + full suite + build green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant