Skip to content

fix(boot): unblock cold-boot core start on dev hosts#1324

Merged
senamakel merged 6 commits intotinyhumansai:mainfrom
senamakel:feat/memory-v5
May 7, 2026
Merged

fix(boot): unblock cold-boot core start on dev hosts#1324
senamakel merged 6 commits intotinyhumansai:mainfrom
senamakel:feat/memory-v5

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented May 7, 2026

Summary

  • Replace redux-persist/lib/storage default import with an inline localStorage adapter (Vite CJS dep pre-bundling resolved it to the module namespace, so cold boot threw storage.getItem is not a function before rehydrate).
  • Add start_core_process to the Tauri capability ACL so the BootCheckGate's spawn invoke is no longer rejected with "Command not found" (it was already a registered handler — just unreferenced in allow-core-process.toml).
  • Purge a stray ~/Library/LaunchAgents/com.openhuman.core.plist left by a prior worktree's service install during Tauri startup so its KeepAlive=true daemon stops re-binding port 7788 on every SIGKILL.
  • Switch the boot liveness probe from openhuman.ping (unregistered) to core.ping (Tier-1 dispatcher in src/core/dispatch.rs).
  • Skip the Proactively kill stale old OpenHuman RPC processes before the Tauri app reuses their port #1130 stale-listener takeover when the embedded task is already alive on the port — fixes self-takeover loop on second start_core_process invocation (HMR re-mount).
  • Read core version from response.result.versionupdate_version is wrapped by RpcOutcome::single_log, so the JSON-RPC body is { result: VersionInfo, logs }. Earlier readers used result.version_info.version and then result.version; both fell through to '' and pinned every boot to "outdated local".

Problem

After #1316 added the BootCheckGate, cold boot on a dev macOS host failed in six different ways stacked on top of each other:

  1. Vite's optimized-deps bundle of redux-persist/lib/storage exposed only the module namespace, so coreMode's persist adapter had no getItem and the renderer crashed before paint with Uncaught TypeError: storage.getItem is not a function.
  2. The new start_core_process Tauri command was registered but missing from the capability ACL → Command not found.
  3. A stray launchd agent from a sibling worktree (KeepAlive=true) re-spawned the daemon after every SIGKILL, defeating the Proactively kill stale old OpenHuman RPC processes before the Tauri app reuses their port #1130 stale-listener takeover and producing signaled pid <X> but port 7788 remained bound after 5000ms.
  4. The boot probe polled openhuman.ping, which the core doesn't register — every retry returned -32601 Method not found, the 10s budget elapsed, and the gate showed "Local core did not respond in time" with a healthy core sitting underneath.
  5. After (4) was fixed, a second start_core_process (HMR re-mounting BootCheckGate) saw the embedded server's own port as bound and walked into the takeover path against itself, producing port 7788 rebounded to pid <X> after terminating pid <Y>; refusing to kill a different process.
  6. After (5), the gate kept showing "Local core needs a restart" because the version reader navigated to a path that didn't exist in the JSON-RPC response (RpcOutcome wraps the value in { result, logs }).

End result: the BootCheckGate was effectively unreachable on a freshly-checked-out dev tree.

Solution

Six narrow, root-cause fixes — one per failure mode, each with a code comment naming the symptom so the next person hitting it can grep their way back here:

  • app/src/store/index.ts — inline localStorage adapter for coreMode, matching the shape of userScopedStorage. Same async semantics as the redux-persist default; just doesn't depend on Vite's CJS interop.
  • app/src-tauri/permissions/allow-core-process.toml — add start_core_process.
  • app/src-tauri/src/lib.rssetup closure now runs a best-effort launchctl bootout + fs::remove_file on ~/Library/LaunchAgents/com.openhuman.core.plist before CoreProcessHandle::new. Best-effort: missing HOME, missing id, plist already absent — all degrade silently. macOS-only via cfg(target_os = "macos").
  • app/src/lib/bootCheck/index.ts — probe core.ping (Tier-1), read response.result.version. Doc comments reference the upstream contracts (src/core/dispatch.rs, src/openhuman/update/ops.rs, src/rpc/mod.rs).
  • app/src-tauri/src/core_process.rs — idempotent fast path at the top of ensure_running: if self.task is Some and unfinished and the port is open, return Ok without identifying or killing the listener. Takeover path stays intact for genuinely external leftover binaries.
  • Test fixtures in app/src/lib/bootCheck/index.test.ts updated to match the wrapped response shape and core.ping method name.

Verified end-to-end on a real cold boot under pnpm dev:app: start_core_process invokes once → embedded core binds 7788 → core.ping succeeds → update_version returns { result: { version: '0.53.18', ... } } → gate transitions to match and renders children. restart_core_process and HMR re-mount paths also exercise the new self-takeover guard cleanly.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per docs/TESTING-STRATEGY.md
  • N/A: behaviour-only fix wiring across already-tested layers — boot-check unit tests updated in lockstep but no new diff coverage to chase here; Rust core-process change is a guard around an existing path
  • N/A: behaviour-only change — capability catalog (src/openhuman/about_app/) doesn't track ACL/persist-adapter wiring
  • N/A: no new feature IDs touched
  • No new external network dependencies introduced (mock backend used per docs/TESTING-STRATEGY.md)
  • N/A: cold-boot fix below the release-smoke surface
  • N/A: no linked issue — discovered while running pnpm dev:app on the boot-gated branch

Impact

  • Runtime: desktop dev hosts on macOS were the failure surface; CI Linux + production builds didn't hit it because (1) ships an existing optimized bundle, and (3) is a per-developer launchd artifact. Fix is no-op on hosts that didn't have the stray plist.
  • Migration: none — the inline storage adapter writes/reads the same localStorage keys redux-persist's default did.
  • Security: launchd-plist purge is bounded to the com.openhuman.core label and lives in the user's ~/Library/LaunchAgents. No privileged paths touched.
  • Performance: removed an HTTP round-trip per boot tick (openhuman.ping was failing instantly with -32601, now core.ping succeeds on the first try).

Related

  • Closes:
  • Follow-up PR(s)/TODOs: the [imessage] core RPC token is not initialized warning during boot is a benign startup race in the imessage scanner — worth a small wait_for_token gate but out of scope here. The DaemonHealth - setting status to disconnected warning under pnpm dev:app likely needs the legacy daemon-mode health check disabled now that the core is always embedded.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: `feat/memory-v5`
  • Commit SHA: `8b0a8012`

Validation Run

  • `pnpm --filter openhuman-app format:check`
  • `pnpm typecheck`
  • Focused tests: `pnpm exec vitest run src/lib/bootCheck/index.test.ts` (18/18)
  • Rust fmt/check (if changed): `cargo check --manifest-path app/src-tauri/Cargo.toml`
  • Tauri fmt/check (if changed): same as above

Validation Blocked

  • `command:` N/A
  • `error:` N/A
  • `impact:` N/A

Behavior Changes

  • Intended behavior change: BootCheckGate now resolves to the running app on cold boot instead of looping on "Local core needs a restart" / "did not respond".
  • User-visible effect: dev-host cold boot works without manually killing daemons or removing launchd plists.

Parity Contract

  • Legacy behavior preserved: stale-listener takeover (Proactively kill stale old OpenHuman RPC processes before the Tauri app reuses their port #1130) still fires for genuinely external leftover binaries; only same-process self-takeover is short-circuited. redux-persist storage shape is unchanged (same async `getItem/setItem/removeItem` contract, same key prefixes).
  • Guard/fallback/dispatch parity checks: `waitForCore` retry loop unchanged; `checkVersion` still returns `noVersionMethod` on `-32601` and `unreachable` on transport failures.

Summary by CodeRabbit

  • Bug Fixes

    • Prevented self-takeover and duplicate startup of the local core process.
    • Fixed macOS startup by removing stale service entries that could block boot.
    • Ensured local core can be invoked when needed (startup permission added).
    • Improved resilience for persisted settings to avoid crashes on cold boot or storage errors.
  • Tests

    • Updated boot-check tests to cover new RPC behavior and additional failure scenarios.

senamakel added 5 commits May 6, 2026 23:20
Three regressions surfaced by tinyhumansai#1316's BootCheckGate flow on cold boot:

- redux-persist's `lib/storage` default import resolves to the module
  namespace under Vite's CJS dep pre-bundling, so `coreMode`'s storage
  adapter had no `getItem` and rehydrate threw on first paint. Replace
  with an inline localStorage-backed adapter matching `userScopedStorage`'s
  shape.
- `start_core_process` was registered as an invoke handler but missing
  from the Tauri capability ACL, so the gate's spawn invoke was rejected
  with "Command not found". Add it to allow-core-process.
- A stale `~/Library/LaunchAgents/com.openhuman.core.plist` left over
  from a prior worktree's `service install` (KeepAlive=true) re-spawned
  the daemon after every SIGKILL, defeating `ensure_running`'s
  stale-listener takeover and producing "signaled pid <X> but port 7788
  remained bound after 5000ms". Bootout + unlink the plist on macOS
  startup before `CoreProcessHandle::new`.
Two more cold-boot regressions surfaced once start_core_process was
reachable:

- BootCheckGate's `waitForCore` polled `openhuman.ping`, but the core
  registers ping under the Tier-1 dispatcher namespace as `core.ping`
  (see src/core/dispatch.rs). Every probe came back with -32601, the
  10s budget elapsed, and the gate showed "Local core did not respond
  in time" despite the core being healthy. Switch the probe — and the
  matching test fixtures — to `core.ping`.
- A second `start_core_process` invocation (e.g. HMR re-mounting the
  gate) saw the embedded server's own port as bound and walked into
  the tinyhumansai#1130 stale-listener takeover path against itself, producing
  "port 7788 rebounded to pid <X> after terminating pid <Y>; refusing
  to kill a different process". Add an idempotent fast path: if the
  in-process task is still alive and the port is open, the listener
  is us — return Ok without identifying or killing anything. The
  takeover path stays for genuinely external leftover binaries.
`openhuman.update_version` returns a flat VersionInfo
(`{ version, target_triple, asset_prefix }`) — see
src/openhuman/update/ops.rs::update_version. The boot-check was reading
`result.version_info.version`, a path that doesn't exist, so coreVersion
was always '' and every cold boot landed on `outdatedLocal` ("Local core
needs a restart") even when the running core matched the app build.

Read `result.version` directly and update the test fixtures to match.
`openhuman.update_version` is wrapped by RpcOutcome::single_log, so the
JSON-RPC response shape is `{ result: VersionInfo, logs }` (see
src/rpc/mod.rs::into_cli_compatible_json). The previous reader looked at
`result.version` and skipped the outer `result` wrapper, so coreVersion
was '' and the boot gate kept re-prompting "Local core needs a restart"
even after the user clicked through.

Read `response.result.version` and update the matching test fixtures.
@senamakel senamakel requested a review from a team May 7, 2026 06:40
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4f5183db-bf85-4816-951d-5cc2ec49d171

📥 Commits

Reviewing files that changed from the base of the PR and between 8b0a801 and 5655075.

📒 Files selected for processing (1)
  • app/src-tauri/src/lib.rs

📝 Walkthrough

Walkthrough

This PR enables Tauri start_core_process, adds an idempotent fast path in core startup, removes stale macOS LaunchAgent plists in dev, updates boot-check RPCs to core.ping and RpcOutcome result.version, strengthens boot-check tests, and guards redux-persist localStorage access for cold-boot safety.

Changes

Core process startup and boot-check integration

Layer / File(s) Summary
Permissions & Core Startup Infrastructure
app/src-tauri/permissions/allow-core-process.toml, app/src-tauri/src/core_process.rs, app/src-tauri/src/lib.rs
Permission allowlist adds start_core_process; CoreProcessHandle::ensure_running adds idempotent fast path when embedded task exists and RPC port is open; macOS setup (debug) removes stale com.openhuman.core.plist and attempts launchctl bootout for the per-user label.
Boot-check RPC Protocol Updates
app/src/lib/bootCheck/index.ts
Boot-check now probes with core.ping and reads openhuman.update_version as { result: { version } }, deriving coreVersion from response.result.version.
Boot-check Test Coverage
app/src/lib/bootCheck/index.test.ts
Unit tests updated for core.ping and new update_version response shape; adds/adjusts edge cases for no-version-method, service_status throwing, empty-version handling, repeated failures, and failed Tauri core startup; fake timers preserve test speed.
State Persistence for Cold Boot
app/src/store/index.ts
Introduces an inline localStorageAdapter that wraps localStorage calls in Promise.resolve() and try/catch, and updates coreModePersistConfig to use this adapter and persist only the mode field.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • tinyhumansai/openhuman#280: Also modifies app/src-tauri/permissions/allow-core-process.toml to add missing allowed commands and prevent invoke rejections.
  • tinyhumansai/openhuman#1316: Related local core startup and boot-check orchestration changes; shares start_core_process allowlist and boot-check behavior adjustments.
  • tinyhumansai/openhuman#1166: Alters ensure_running lifecycle behavior; related to takeover/probing logic that this PR refines with an idempotent fast path.

Poem

🐰 I hopped where plist files lay asleep,
I added permission, dug trenches deep.
No self-takeover, core.ping on the wire,
Tests sharpened up, persistence made me wiser.
The core wakes up — a gentle, jolly leap!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the PR: fixing cold-boot core startup issues on macOS dev hosts by removing stale launch agents and enabling the start_core_process command.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src-tauri/src/lib.rs`:
- Around line 1357-1403: The macOS cleanup always runs and unconditionally calls
launchctl bootout on STALE_LABEL and deletes
~/Library/LaunchAgents/{STALE_LABEL}.plist; restrict this to the
stale-dev-worktree recovery path by either gating the whole block behind a
debug/dev-only check (e.g., cfg!(debug_assertions) or an env var) or by
validating the plist points at a different executable/worktree before doing
bootout/remove: read the plist (or its Program/ProgramArguments) and compare
with std::env::current_exe(), only run the
Command::new("launchctl").arg("bootout").arg(&target) and
std::fs::remove_file(&plist) when the plist is stale; ensure you keep the
STALE_LABEL constant and the existing uid/bootstrap logic but skip
bootout/remove for a matching/active daemon so normal service install +
daemon_mode is not torn down.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e7803e4-490e-4f87-b0e8-2b6b4ba482f9

📥 Commits

Reviewing files that changed from the base of the PR and between a6c7b0f and 8b0a801.

📒 Files selected for processing (6)
  • app/src-tauri/permissions/allow-core-process.toml
  • app/src-tauri/src/core_process.rs
  • app/src-tauri/src/lib.rs
  • app/src/lib/bootCheck/index.test.ts
  • app/src/lib/bootCheck/index.ts
  • app/src/store/index.ts

Comment thread app/src-tauri/src/lib.rs
CodeRabbit (PR tinyhumansai#1324) flagged that the unconditional purge would also
tear down a legitimate `service install` on every restart. Tighten the
gate so the cleanup only fires when:

- `cfg!(debug_assertions)` — release builds never auto-purge.
- `!daemon_mode` — the process running as the daemon doesn't kill its
  own plist out from under itself.
- the plist's `ProgramArguments[0]` points at a binary other than
  `current_exe()` — so a `service install` against THIS build is
  preserved; only sibling-worktree leftovers get zapped.

The plist parse is intentionally minimal (first `<string>` after the
`ProgramArguments` key, no full XML parser) — the installer always
writes a single absolute binary path there (see
src/openhuman/service/macos.rs).
@senamakel senamakel merged commit 22946a1 into tinyhumansai:main May 7, 2026
19 of 20 checks passed
senamakel added a commit to senamakel/openhuman that referenced this pull request May 7, 2026
Coverage Gate failed on PR tinyhumansai#1324 because the inline `localStorageAdapter`
in `app/src/store/index.ts` was 13/18 missed lines (top-level module init
isn't exercised by any test, so the try/catch arms inside getItem/setItem/
removeItem all read as uncovered).

Extract the adapter to `app/src/store/localStorageAdapter.ts` and add
8 unit tests covering: success path, missing-key path, and the swallowed-
throw path for each of the three methods, plus a shape assertion that
each returns a Promise (so redux-persist's await machinery is satisfied
even when the underlying call throws).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant