Skip to content

thrum socket path mismatch + hum doctor false positive on stale socket #40

@adiled

Description

@adiled

Summary

Two related issues that together caused all 30 forager bees to crash-loop without any clear signal from hum doctor or hum bee --list.

Issue 1: thrum socket path inconsistency

humd plist sets HUM_THRUM_SOCK=/tmp/hum-501/hum/thrum.sock (the XDG_RUNTIME_DIR/hum/thrum.sock convention).

All bee plists (generated by hum bee enter / hive install) set HUM_THRUM_SOCK=/tmp/hum-501/thrum.sock (the old XDG_RUNTIME_DIR/thrum.sock path).

When humd crashes and restarts, it creates a new socket at its own path (…/hum/thrum.sock). Every bee still points at the old path → "connection refused" on every connect attempt → all bees crash-loop with exit code 1.

Observed: after humd died overnight, all 30 bees showed exit code 1 in launchctl list. None could connect until the humd plist was manually edited to match the bee plist path, then humd was restarted.

Workaround applied: changed humd plist from /tmp/hum-501/hum/thrum.sock/tmp/hum-501/thrum.sock to match the bee plists. Restarted humd. All bees connected within seconds.

Root fix needed: either (a) hum bee enter / hive install should write the same socket path that humd actually binds, or (b) when humd restarts it should bind at the path the bee plists already reference. The two plists should be kept in sync by the same codepath that generates them.

Issue 2: hum doctor false positive on stale socket

Expected: hum doctor detects that the thrum socket is not accepting connections.

Actual: hum doctor reports thrum sock: /tmp/hum-501/thrum.sock ✓ present even when the socket file is stale (humd crashed, nothing is listening). The check only tests Path::exists(), not connectivity.

Evidence: during the crash-loop period, hum doctor output showed ✓ for the socket. Connecting to the socket via any client gave "connection refused (os error 61)". nc -z -U /tmp/hum-501/thrum.sock returned exit 1.

Fix: the doctor check should attempt a brief connect to the socket (e.g. write a byte, expect a breath back within 1s). If the connect fails or times out, report the socket as broken, not present.

Issue 3: hum bee --list does not surface crash-loop state

hum bee --list reported all 30 bees as in nest (service running) throughout the crash-loop period. The actual launchctl state was exit code 1, null PID for most bees. hum doctor did surface the exit codes, but only in the detailed [bees] section which is easy to miss.

Suggestion: hum bee --list should show a warning indicator (e.g. ⚠ crash-looping (exit 1)) for any bee whose launchctl state shows a non-zero exit code and null PID.

Reproduction

  1. Start all bees normally (they're running, all connected).
  2. Kill humd (pkill humd) and wait for launchctl to restart it.
  3. Observe: if humd plist and bee plists have different HUM_THRUM_SOCK paths, all bees crash-loop immediately.
  4. Run hum doctor — socket shows ✓ present.
  5. Run hum bee --list — bees show "in nest (service running)".

Environment

  • hum CLI 0.31.16
  • humd 0.31.16 (thrum 0.7.0)
  • macOS aarch64
  • 30 forager bees (daman swarm)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions