Skip to content

fix: graceful shutdown on external (UPS/qm/ACPI) shutdown#3319

Merged
dr-bonez merged 2 commits into
masterfrom
fix/graceful-external-shutdown
Jun 15, 2026
Merged

fix: graceful shutdown on external (UPS/qm/ACPI) shutdown#3319
dr-bonez merged 2 commits into
masterfrom
fix/graceful-external-shutdown

Conversation

@helix-nine

@helix-nine helix-nine commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Makes externally-initiated shutdowns graceful, so StartOS service containers are torn down by startd before systemd proceeds — covering qm shutdown, ACPI, and UPS/NUT-triggered shutdown -h (the dependency for #3317). Closes #3235.

Previously only a UI / start-cli server shutdown initiated shutdown ran startd's graceful teardown; a system-initiated shutdown let systemd stop everything around startd, terminating services abruptly (data-corruption risk for Lightning/Bitcoin workloads).

Approach (revised per review)

startd already handles SIGTERM by running the same graceful teardown as a UI shutdown. The gap is purely ordering: systemd ordering is symmetric, so because startd starts before the things it manages, it is stopped after them — the opposite of what we need. Two small pre-shutdown barrier units invert that:

  • core/startos-shutdown.service (poweroff/halt) and core/startos-restart.service (reboot/kexec). DefaultDependencies=no binds each to its specific target (not the generic shutdown.target), so each fires only on its own mode. Ordered After=startd.service (and Before=/Conflicts= its target), each unit's ExecStop calls start-cli server shutdown / restart — which now waits for graceful teardown — while startd is still up. start-cli authenticates locally via /run/startos/rpc.authcookie. startd then re-issues the matching final poweroff/reboot, which is idempotent since systemd is already on its way there.

  • server shutdown / restart gain a wait (core/src/shutdown.rs): a watch-backed completion signal on RpcContext (wait_closed(), fired after services.shutdown_all()). Default false over the API (the frontend keeps its immediate reply — it already sends {}), default true on the CLI with --nowait to opt out.

Files

  • core/src/shutdown.rsShutdownParams { wait }, wait-on-teardown.
  • core/src/context/rpc.rsclosed watch + wait_closed().
  • core/startos-shutdown.service, core/startos-restart.service — barrier units.
  • Makefile, debian/startos/postinst — install + enable.
  • core/locales/i18n.yamlhelp.arg.nowait (×5 locales).

Validation status

  • Not yet build- or runtime-verified. TODO before merge: cargo check, make ts-bindings (generate ShutdownParams.ts + SDK rebuild), cross-layer typechecks, and a VM test of an external poweroff and reboot on beta-9 (diff teardown logs against a UI shutdown). Pushed for review per @dr-bonez.

@dr-bonez

Copy link
Copy Markdown
Member

is StartOS's existing SIGTERM handling already sufficient for this use case?

@helix-nine

Copy link
Copy Markdown
Contributor Author

Good question — and digging in, you're right to push on it. startd already handles SIGTERM.

In core/src/bins/startd.rs the signal task selects on interrupt()/quit()/terminate(), and on any of them does shutdown.send(None), which drives RpcContext::shutdown()services.shutdown_all() → per-container lxc-stop (+ unmount/destroy). That's the same graceful teardown as a UI/start-cli shutdown — the only difference is the None path skips Shutdown::execute() (the LVM export() + poweroff/reboot), which is correct on an external shutdown since systemd owns the host poweroff.

So the graceful path is already wired on SIGTERM. Re-evaluating this PR against that, most of it is redundant:

  • Before=shutdown.target reboot.target poweroff.targetDefaultDependencies=yes already injects Before=shutdown.target + Conflicts=shutdown.target for both .service units. No-op.
  • KillMode=mixed — the LXC monitor/payload processes live under services.slice (per lxc.cgroup.dir.* in config.template), not startd's cgroup, so control-group vs mixed doesn't change whether containers get killed. Negligible.
  • TimeoutStopSec=120 — the only knob that can actually matter, and only if services.shutdown_all() needs more than the default 90s.

So at the startd level the existing handling looks sufficient. That also makes me distrust the #3235 root-cause writeup — it claims the units "lack Before=shutdown.target", which isn't true for a .service. The reporter did observe real abrupt termination though, so something defeats the graceful path in practice. My leading suspicion is teardown overrunning the default 90s stop timeout: lxc-stop is invoked per-container with no --timeout, so a slow/serialized teardown across several services can blow past 90s → systemd SIGKILLs startd mid-shutdown → the not-yet-stopped containers die abruptly. The other candidate is the per-service subcontainer SIGKILL fallback the reporter flagged (subcontainer/sync.rs), which is internal to a service's own stop and wouldn't be touched by any unit-file change.

Given that, I don't want to merge a mostly-cosmetic PR. I'd rather reproduce first: spin a startos-beta-9 VM, install a service, trigger an ACPI/systemctl poweroff external shutdown, and diff the teardown logs against a UI shutdown to see exactly where it diverges. Want me to do that before changing anything — and if it turns out to be the timeout, narrow this PR to just TimeoutStopSec (or fix the real overrun) instead of the full drop-in?

Adds an opt-in wait that blocks until graceful teardown (services.shutdown_all)
completes, via a watch-backed completion signal on RpcContext. Default false
over the API (frontend keeps its immediate reply); the CLI will default it true
with --nowait. Groundwork for a systemd pre-shutdown barrier so external
shutdowns (UPS/qm/ACPI) tear down containers gracefully.
Two systemd units (startos-shutdown.service / startos-restart.service) with
DefaultDependencies=no so each binds to poweroff/halt vs reboot/kexec
specifically. Ordered After=startd.service (Before the respective target), their
ExecStop calls start-cli server shutdown/restart, which waits for graceful
container teardown before systemd proceeds. Authenticates locally via the rpc
authcookie. Makes externally-initiated shutdowns (UPS/qm/ACPI) graceful.
@helix-nine helix-nine force-pushed the fix/graceful-external-shutdown branch from 33bccf2 to 6177d02 Compare June 15, 2026 22:46
@helix-nine helix-nine changed the title fix: graceful container shutdown on external (qm/ACPI/UPS) shutdown fix: graceful shutdown on external (UPS/qm/ACPI) shutdown Jun 15, 2026
@dr-bonez dr-bonez merged commit 5f9665f into master Jun 15, 2026
16 checks passed
@dr-bonez dr-bonez deleted the fix/graceful-external-shutdown branch June 15, 2026 22:58
@helix-nine

Copy link
Copy Markdown
Contributor Author

Thanks for the review and merge. Since this went in ahead of a local build, I'm watching the master CI now (Automated Tests + the compile/Debian-package matrix) and will fix forward immediately if anything trips.

One tracked follow-up: the new #[ts(export)] ShutdownParams binding (sdk/base/lib/osBindings/ShutdownParams.ts) hasn't been regenerated — run-tests.sh skips the export_* tests so CI won't flag it, and nothing imports it yet, so it's harmless, but I'll run make ts-bindings to land it for consistency when I next have a warm build env. Likewise the beta-9 VM test (external poweroff + reboot vs UI shutdown) is still worth doing as a post-merge confirmation — happy to run it if you free a slot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proxmox - StartOS 0.4.0-beta.9 VM- qm, ACPI, or UPS-triggered VM shutdown does not terminate services gracefully

2 participants