draug

A small, cgroup-aware process supervisor in Rust — built for containers — that restarts the target gracefully, before the OOM killer does.

draug [options] -- <command> [args...]

The problem it solves

Long-running processes — a script, a server, a worker, a daemon, anything that stays up for hours — tend to leak memory over time. The standard mitigation is recycling: restart on a timer and restart before the process hits its memory limit. On a bare host or VM, mature tools already do this well — but inside a container they fall apart, for two reasons:

Generic tools measure the wrong number. psutil, monit, and circus read host-level statistics (/proc/meminfo and friends), not the cgroup the container actually lives in. Under a container memory limit their numbers are simply wrong — the only correct source of truth is the cgroup itself.
The OOM killer is brutal. When a cgroup hits its limit the kernel sends SIGKILL — no chance to finish in-flight work, flush state, or drain connections, leaving jobs half-done and data inconsistent. systemd-oomd / oomd can act earlier, but they also SIGKILL — and they aren't available on AWS ECS/Fargate at all. That is exactly where this problem bites hardest: a managed container platform with a hard memory limit but no built-in graceful pre-OOM mechanism. draug is built for precisely that gap.

Why draug exists

draug was built to do the one thing those tools can't: read the cgroup directly and act before the kernel would OOM-kill the target, restarting it the gentle way — SIGTERM → grace period → SIGKILL — so the application gets a real chance to shut down cleanly. The graceful part (finishing jobs, closing resources) lives in the target's own signal handler; draug decides when to restart, sends the signal, and escalates to SIGKILL only if the target hangs.

It is deliberately tiny: a single static binary, no interpreter, no GC, one supervised target per container. It does one job — supervise your process — and leaves the PID 1 duties (reaping orphans, forwarding signals) to a minimal init like tini or docker run --init, which most container setups already run anyway. draug works the same whether or not one is present; an init just keeps a misbehaving target from leaking zombie processes.

The name: draug is guard reversed (a nod to tini = init reversed), and Old Norse draugr — the undead that returns from the grave. Fitting for a supervisor whose whole job is resurrecting its target.

What it does

draug restarts the target gracefully on any of:

a periodic timer (--restart-interval) — flush slow leaks proactively;
a cgroup memory threshold (--mem-threshold, memory.current / memory.max);
PSI memory pressure (memory.pressure, event-driven, with a polling fallback);
a stale heartbeat (--heartbeat-file) — catches alive-but-stuck targets;

and it backs off and gives up after repeated crash-loops. CPU is monitored for diagnostics/alerts only, never as a restart trigger.

Install

Linux only (x86_64 / aarch64). Prebuilt static (musl) and glibc binaries are attached to every GitHub Release.

One-line install (picks the right binary for your platform):

curl --proto '=https' --tlsv1.2 -LsSf \
  https://github.com/dvazar/draug/releases/latest/download/draug-installer.sh | sh

In a container, copy a static binary straight into your image:

COPY --from=builder /draug /usr/local/bin/draug

Or download a binary manually from the Releases page.

How it works

A single synchronous epoll loop drives everything, over four fds:

fd	source	role
`signalfd`	`SIGTERM`/`SIGINT`/`SIGCHLD` (blocked, read via fd)	`TERM` from PID 1 / the orchestrator → graceful shutdown + exit; `SIGCHLD` → reap
tick `timerfd`	a 1–2 s tick	read `memory.current/max`, heartbeat age, periodic deadline
action `timerfd`	a deferred one-shot	the active deadline: grace period, `SIGKILL`-confirm, or respawn
`psi_fd`	`memory.pressure` trigger	`EPOLLPRI` = pressure crossed the threshold

All trigger logic lives in a pure decision core (decision::evaluate) that takes samples + state and returns a decision — no I/O, so it is exhaustively table-tested. Precedence is Crash > HeartbeatStale > Psi > Memory > Periodic, and heartbeat/psi/memory are gated by a configurable startup grace. Both PSI modes (event and poll) flow through this same core, so grace and precedence have a single source of truth.

The control logic is a second pure state machine (fsm::step): given the current state, an event, and the latest samples, it returns the next state and a list of side-effect actions. A thin I/O shell owns the epoll loop, the child, and the fds; each wakeup it reaps first (so a crash is never mislabeled as a graceful restart), turns fd readiness into events, runs fsm::step, and executes the returned actions (spawn, signal, snapshot, alert, log). The states — Running → Draining → Killing → Backoff → Respawning — distinguish a restart (reap, then respawn) from a shutdown (drain, then exit), so an operator's SIGTERM stops the container cleanly while an internal trigger restarts the target. A hung target is escalated to SIGKILL; if it survives even that within a bounded confirm window, draug exits non-zero rather than hanging forever. This split — a pure core plus an I/O shell — makes the whole lifecycle policy host-testable without spawning real processes.

Build

cargo build --release          # binary at target/release/draug

Requires Rust (edition 2024). Linux for the real thing (cgroups, PSI, epoll); the pure/parser layers build and test on macOS too.

Usage

Put your command after --:

draug \
    --restart-interval 30m \
    --mem-threshold 0.85 \
    --grace-period 90s \
    --heartbeat-file /run/draug/hb \
    -- <command> [args...]

Run it under a minimal init so PID 1 reaps orphans and forwards signals — tini -g -- draug …, or simply start the container with docker run --init (which gives you tini). Most orchestrators already run an init as PID 1, so often there is nothing extra to add.

Disable any trigger individually: --restart-interval 0, --mem-threshold 0, --psi-trigger "" (PSI is disabled by an empty string, not 0). So one binary covers everything from "timer only" to "timer + memory + PSI + heartbeat".

Configuration

CLI flags (the target command follows --):

Flag	Default	Purpose
`--restart-interval <dur>`	`30m`	Periodic graceful restart. `0` = off
`--mem-threshold <ratio>`	`0.85`	Restart when `memory.current/memory.max >= ratio`. `0` = off
`--psi-trigger <stall/win>`	`150000/1000000`	PSI `some` threshold in µs. Empty = off; requires `stall <= window`
`--graceful-signal <T\|I>`	`TERM`	Graceful stop signal (`TERM` or `INT`)
`--grace-period <dur>`	`90s`	How long to wait before `SIGKILL` (must be `> 0`)
`--tick <dur>`	`2s`	Memory/heartbeat poll period (must be `> 0`)
`--heartbeat-file <path>`	— (off)	Heartbeat file; target updates its mtime, draug reads `now - mtime`
`--heartbeat-max-age <dur>`	`60s`	Staleness threshold (must be `> 0` when a heartbeat file is set)
`--startup-grace <dur>`	`15s`	Suppress heartbeat/triggers for the first N s (must be `> 0`)
`--max-failures <n>`	`3`	Consecutive failed starts → alert + exit (must be `>= 1`)
`--backoff <dur>`	`5s`	Backoff base (grows linearly with the failure streak)
`--cgroup-root <path>`	`/sys/fs/cgroup`	Override the cgroup root (testing)
`--log-level <level>`	`info`	Min level to emit: `error`/`warn`/`info`/`debug`. Env `DRAUG_LOG_LEVEL`
`--log-timestamps`	off	Prefix each line with an RFC3339 UTC timestamp. Env `DRAUG_LOG_TIMESTAMPS`

Durations accept ms/s/m/h (a bare number means seconds). Zero values for the protection knobs (--startup-grace, --heartbeat-max-age with a heartbeat file, --max-failures, --tick, --grace-period, --backoff) are rejected, because they would silently defeat the very protection they configure.

Environment

Variable	Purpose
`DRAUG_WEBHOOK_URL`	POST anomaly alerts as JSON
`DRAUG_SERVICE`	Service name tag in the alert payload
`DRAUG_ENV`	Environment tag (`dev`/`staging`/`prod`)
`DRAUG_HEARTBEAT_FILE`	Passed to the target so it knows where to write heartbeats
`DRAUG_LOG_LEVEL`	Min log level (`error`/`warn`/`info`/`debug`); `--log-level` overrides it
`DRAUG_LOG_TIMESTAMPS`	Truthy (`1`/`true`/`yes`/`on`) enables RFC3339 timestamps (or `--log-timestamps`)

Alerting

Alerts fire on anomalies only — SIGKILL escalation (a hung target), crash-loop give-up, and memory/PSI restarts (a growing leak). Periodic restarts and single recovered crashes are logged, not alerted. Each payload carries the restart reason, a lifetime restart count, and a diagnostic snapshot (memory/CPU/threads/fds/heartbeat age) captured while the target is still alive.

Delivery runs on a background worker with a severity-aware bounded queue that never blocks the event loop. Under saturation, Critical alerts (escalation, crash-loop) are never silently dropped — a new Critical evicts the oldest Warning to make room; only Warnings are shed (counted, with rate-limited logging). If the worker ever dies, that is surfaced once rather than silently black-holing alerts.

Logging

draug logs one structured event per line to stderr in logfmt (key=value), so collectors (CloudWatch Logs Insights, Loki) parse it without a custom pattern while it stays readable in docker logs:

draug: level=info event=spawned target="<command> [args...]" pid=1234 heartbeat=/run/draug/hb
draug: level=warn event=restart target="..." pid=1234 reason=HeartbeatStale escalated=false restarts=3
draug: level=error event=crash-loop target="..." pid=1234 failures=3 restarts=2 action=giving-up

Every lifecycle line carries the target's identity — its command line and the heartbeat path it watches — plus the child pid. So when several draug-supervised targets run side by side (or N replicas of one), you can tell which one (re)started or crashed instead of reading an indistinguishable line.

--log-level sets the minimum severity emitted (error < warn < info < debug): info (the default) shows lifecycle events; warn keeps only restarts, crashes, and errors; debug adds internal diagnostics. Timestamps are off by default — CloudWatch and journald already stamp each line — and turned on with --log-timestamps for a bare docker logs with no collector in front.

Key events: spawned (info, every spawn/respawn, with the new pid), restart (warn, with reason/escalated/restarts), crash (warn), crash-loop (error). The restarts= field is the running count of restarts completed so far (the same monotonic value on the restart and crash-loop lines). The crash-loop give-up line and its webhook payload (see Alerting) quote that same number; an anomaly alert (memory/PSI) instead counts the in-progress restart, so its restart_count reads one higher than the stderr restarts=.

Guarantees & design notes

cgroup-accurate. Reads memory.current / memory.max from the cgroup, auto-detecting v1/v2 — not host /proc stats.
Graceful first, forceful only if needed. SIGTERM → grace → SIGKILL, signalling the whole target process group; a clean exit at the grace boundary is recognized and not mislabeled as a hang.
EINTR-safe reaping. A transient interrupted waitpid is retried, so a clean exit is reliably observed and shutdown never stalls.
Crash-loop backoff. Linear backoff; after --max-failures consecutive failed starts draug exits non-zero so PID 1 tears down the container and the orchestrator (e.g. ECS) restarts the task — the last-resort safety net.
Graceful degradation. Missing cgroup files, an unreadable memory limit, or no PSI write access disable the relevant trigger (logged once) rather than crashing — the timer and heartbeat keep working. A PSI trigger that only becomes available after a slow startup is picked up automatically (bounded retry); a PSI fd that errors or hangs up at runtime disables PSI (until the next restart) instead of busy-looping.

Development

Pure / parser / POSIX tests (macOS or Linux): cargo test
Full suite incl. the Linux-only supervisor + lifecycle matrix: ./scripts/test-linux.sh
Lint/format: cargo fmt and cargo clippy --all-targets --features _test_support -- -D warnings

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile.test		Dockerfile.test
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
dist-workspace.toml		dist-workspace.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

draug

The problem it solves

Why draug exists

What it does

Install

How it works

Build

Usage

Configuration

Environment

Alerting

Logging

Guarantees & design notes

Development

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

draug

The problem it solves

Why draug exists

What it does

Install

How it works

Build

Usage

Configuration

Environment

Alerting

Logging

Guarantees & design notes

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages