Containerise - run spikes on outer and inner containers by neuromaxer · Pull Request #4 · appx-org/agent-server

neuromaxer · 2026-06-11T19:56:39Z

Stage 0 spike: proving rootless Podman works inside an unprivileged Docker container

What this is

A spike (investigation, not production code) validating the core assumption behind our containerised-apps architecture: that we can run a Docker "outer" container that itself runs Podman to build and serve user apps — without giving it root or privileged access to the host.

Result: it works, securely and fast. This PR captures the proven recipe + a detailed writeup so we can build Stage 2 (agent-server in the container) with confidence.

Nothing here ships yet — it's all under container/ (throwaway test rig) and docs/.

TL;DR of what we found

✅ Full chain works: host → Docker → rootless Podman → inner app, reachable via published port
✅ No --privileged, no dangerous capabilities, runs as a non-root user
✅ Zero host changes needed beyond "install Docker" — Ubuntu's strict security defaults stay on
✅ Fast: native overlay storage (~2× faster than the fallback), sub-second startup
✅ Restart recovery works (apps come back after a container restart)

The real blocker turned out not to be AppArmor (what we expected) but a subtle setuid packaging issue with Ubuntu's newuidmap helper — fixed in 3 Dockerfile lines (the same trick the official Podman image uses). Full root-cause trace is in the findings doc.

What to review (in priority order)

container/SPIKE-FINDINGS.md — start here. The complete writeup: the headline fix, every Docker flag justified, security analysis, benchmarks, restart behaviour. Reads top-to-bottom.
container/Dockerfile + container/run-outer.sh — the actual proven artifacts. The 4 required run flags are each documented inline with "what breaks if you remove it."
container/seccomp-builder.json (+ gen-seccomp.sh) — our tailored syscall filter. Note it's stricter than just disabling seccomp — only 3 specific syscalls were unblocked.
container/INNER-APP-SPIKE.md - inner (app) container spike findings

The rest (smoke.sh, entrypoint.sh, the docs/plans/* and docs/architecture/other/* files) is supporting context — the test harness and the multi-stage rollout plan this spike is Stage 0 of.

Not in scope (deliberately)

agent-server integration, the deploy skill, appx wiring — all later stages. This PR only answers "is the nesting safe and viable?" → yes.

…estriction active)

…check) Root cause traced with bpftrace: Ubuntu's setuid-root newuidmap runs euid=0 inside the unprivileged outer container, which bypasses the kernel uid_map ownership shortcut and demands CAP_SYS_ADMIN in the init userns (denied by docker's cap bounding set). Switching to file caps (cap_setuid+ep) keeps euid 1000 == nested-userns owner, matching Fedora/podman-stable. Also pre-create volume mountpoints owned by builder so named volumes don't mount root-owned.

- add --device /dev/net/tun (slirp4netns rootless networking needs it) - clear default_sysctls in containers.conf (crun couldn't write read-only /proc/sys/net/ipv4/ping_group_range under the nested netns) - add --security-opt systempaths=unconfined (docker masks /proc submounts; kernel mount_too_revealing() then blocks the inner container's fresh proc mount with 'mount proc to proc: Operation not permitted'). No privilege/caps. - entrypoint wipes stale XDG_RUNTIME_DIR pause-process state on each boot so podman survives docker restart

…ers inner containers after docker restart (Stage 4 recovery mechanism)

Native overlay works on kernel 7.0 and is ~2.2x faster than fuse-overlayfs (582ms vs 1281ms on a 300-file build), and removes the need for /dev/fuse. fuse-overlayfs left installed as documented fallback only.

Deletion-tested all four security flags; each is required with a distinct failure (recorded in findings). Replaced seccomp=unconfined with a tailored profile (container/seccomp-builder.json, gen-seccomp.sh documents provenance): podman's stock profile + ungating only sethostname/setdomainname/setns, while keeping bpf/perf_event_open/quotactl/fanotify_init/lookup_dcookie denied. Strictly tighter than unconfined. Smoke still 11/11.

- Tested podman as outer runtime (installed podman 5.7 on host, recorded): rootless podman-outer fails (rootless-in-rootless subuid exhaustion at newuidmap); rootful podman-outer works with a SMALLER flag set (no seccomp override - podman default allows mount; no systempaths) but needs root + DNS config. docker-outer remains the proven path. - Filled in all findings sections: flag table with exact deletion errors, host-prereqs (none beyond docker), host-changes log (all sysctls restored to hardened defaults), T4 storage, T3 warmup+restart, port chain, Stage 2 recommendations. - Final clean-slate smoke (after docker system prune -af): 11/11 PASS, exit 0.

…complete)

Canvinus

Review — Stage 0 nested-podman spike

I reviewed this by independently reproducing the spike on a clean host, not just reading the diff. I stood up a throwaway Hetzner Ubuntu 24.04 box (kernel 6.8, AppArmor on, apparmor_restrict_unprivileged_userns=1) — the exact production target the findings doc says it couldn't use — plus an arm64 Ubuntu 24.04 VM for cross-arch coverage. Installed Docker 29.5.3 on both and ran the full chain.

Verdict

The spike is sound and its conclusion holds. The unprivileged docker → rootless-podman → inner-app chain works on the real production OS, on both x86_64 and arm64, with zero host changes and the hardened userns sysctl left at 1. Nothing ships from this PR (everything is under container/ + docs/), so merge risk is ~zero. Recommend merge after the one should-fix below (a one-line gen-seccomp.sh correctness bug); everything else is forward-looking and can ride into Stage 2.

What I verified (independent reproduction)

Check	x86_64 / 24.04 (Hetzner)	arm64 / 24.04 (OrbStack)
`smoke.sh` full chain	11/11	11/11
`apparmor=unconfined` actually required	✓ deletion test reproduces the exact overlay-mount denial	n/a (OrbStack kernel has no AppArmor LSM)
Hardened userns sysctl stays `=1` through green runs	✓	n/a
uid isolation: inner "root" → host uid 1000, worker → 100100	✓	✓

This closes the doc's own open item ("re-verify on a genuine Ubuntu 24.04 host before production") — done, it passes.

Strengths

The headline newuidmap root-cause (Dockerfile L31–33) is correct and matches upstream quay.io/podman/stable (rpm --setcaps shadow-utils). It was bpftrace-traced, not guessed.
Every docker run flag is deletion-tested with the exact failure recorded — I reproduced all four independently.
Storage-on-a-volume is load-bearing and correctly chosen: without it, podman's overlay store lands on the container's own overlay rootfs → overlay-on-overlay, which the kernel forbids (torvalds/linux@76bc8e2). I confirmed the broken fuse-overlayfs fallback empirically.

Findings by severity

🔴 Should-fix (1) — gen-seccomp.sh leaves a contradictory ALLOW+ERRNO rule pair for the same three syscalls (inline comment, with one-line fix).

🟡 Recommend for Stage 2 (forward-looking, all tested)

Add --init — I found live zombies (slirp4netns, conmon) in the running container (inline comment has the evidence).
Add resource limits + restart policy — container currently runs uncapped with RestartPolicy=no (inline comment).
Replace apparmor=unconfined with a tailored profile — I built and validated one (inline comment has the full profile; passes the chain in enforce mode while still denying /proc/sysrq-trigger & /proc/kcore). Closes the deferred TODO at run-outer.sh L23.
Optional: drop capabilities to a minimal set — verified the chain works with --cap-drop=ALL + 11 caps vs. the full default 14.

⚪ Nits

Duplicate "Result summary" heading + leftover template stub in SPIKE-FINDINGS.md (inline).
Docs land in docs/plans/ while main now uses docs/superpowers/plans/ (post-refactor convention).

Combined-hardening proof

I ran all the recommendations together in one container — fixed seccomp + tailored AppArmor (enforcing) + --cap-drop=ALL+minimal caps + --init + --memory=3g --cpus=2 --pids-limit=2048 + --restart=unless-stopped — and re-ran the full chain incl. restart recovery:

COMBINED HARDENING: 15 passed, 0 failed
  pid1 is docker-init (reaper) · apparmor podman-builder (enforce) · CapBnd=880404fb
  podman info/run/build/port-chain · zombie count = 0 · sysrq-trigger+kcore DENIED
  workspace+image-store survive restart · podman start recovers inner app

So none of the hardening conflicts with the working setup — it can be adopted wholesale in Stage 2.

Best-practice cross-check (sources)

Confirmed the recipe against: Dan Walsh — "Podman inside a container", the upstream containers/image_build podman image, the moby docker-default AppArmor template, the kernel no_new_privs & overlayfs docs, capabilities(7), and podman v5.0 release notes. All consistent with the spike. One forward note: podman 5.x switches the rootless default slirp4netns→pasta (both still need /dev/net/tun), so a base-image bump warrants a re-smoke.

Canvinus · 2026-06-11T21:28:33Z

+for s in d['syscalls']:
+    inc=s.get('includes',{})
+    if s['action']=='SCMP_ACT_ALLOW' and inc.get('caps')==['CAP_SYS_ADMIN']:
+        s['names']=[n for n in s['names'] if n in NEED]


🔴 Should-fix — a contradictory seccomp rule survives generation.

This loop rewrites only the SCMP_ACT_ALLOW rule, but podman's stock profile also ships a complementary SCMP_ACT_ERRNO rule (excludes.caps=[CAP_SYS_ADMIN]) that also lists sethostname/setdomainname/setns. After this script runs, seccomp-builder.json contains both an ALLOW rule and an ERRNO rule naming those three syscalls. I confirmed this in the generated file — they appear in two separate rules.

Which rule wins on conflicting input is libseccomp/runtime-version-defined, not guaranteed. It happens to resolve ALLOW-wins on Docker 29.5.3 (I verified the chain works on both x86_64 and arm64), but a runtime/libseccomp bump could flip it and break inner-container setup with a confusing Operation not permitted. Since this script's whole job is provenance/correctness, worth fixing.

One-line fix — also strip the names from the deny rule:

for s in d['syscalls']: inc=s.get('includes',{}) if s['action']=='SCMP_ACT_ALLOW' and inc.get('caps')==['CAP_SYS_ADMIN']: s['names']=[n for n in s['names'] if n in NEED] s.pop('includes',None) # also remove them from the complementary ERRNO rule so they aren't both allowed and denied if s['action']=='SCMP_ACT_ERRNO' and s.get('excludes',{}).get('caps')==['CAP_SYS_ADMIN']: s['names']=[n for n in s['names'] if n not in NEED]

Verified on the box: with this applied, the full nested chain still passes (inner podman run exercises exactly these three syscalls during namespace setup).

fixed in 572acf0

Canvinus · 2026-06-11T21:28:33Z

+docker build -t "$IMAGE" .
+docker rm -f "$NAME" 2>/dev/null || true
+
+docker run -d --name "$NAME" \


🟡 Recommend (Stage 2): add --init, resource limits, and a restart policy.

Zombies (real, found it running). With CMD ["sleep","infinity"] (Dockerfile L77) as PID 1, the live container accumulates zombies:

$ docker exec builder-outer ps -eo pid,stat,comm | awk '$2 ~ /Z/' 65 Z slirp4netns 68 Zs conmon

PID 1 is sleep, which reaps nothing and catches no signals (/proc/1/status → SigCgt=0, so docker stop always hits the 10 s SIGKILL timeout). podman detaches a conmon per inner container that reparents to PID 1. docker run --init (tini) fixes both — PID 1 becomes docker-init and zombie count drops to 0 (tested). This matters more at Stage 2: Node-as-PID-1 (agent-server) has the same non-reaping problem.

Limits + restart. docker inspect shows Memory=0, NanoCpus=0, PidsLimit=null, RestartPolicy=no — a runaway inner build/fork starves the host and every other project, and the outer container won't return after a host reboot. Suggest:

--init --memory=3g --cpus=2 --pids-limit=2048 --restart=unless-stopped

The plan defers these to Stage 4, but they're free to add now and I verified they don't disturb the chain (combined run = 15/15).

will do in next stage

Canvinus · 2026-06-11T21:28:33Z

+docker run -d --name "$NAME" \
+	--device /dev/net/tun \
+	--security-opt seccomp="$SECCOMP" \
+	--security-opt apparmor=unconfined \


🟡 The deferred AppArmor TODO on this line is solvable now — I built and tested the profile.

docker-default's only rule that blocks this workload is deny mount,. A tailored profile = docker-default verbatim with that one line swapped for mount, + pivot_root,, keeping every other protection. Loaded on the Hetzner box in enforce mode:

AppArmor on pid1: podman-builder (enforce) info: ok build: ok run+fwd: ok /proc/sysrq-trigger write: DENIED ✓ /proc/kcore read: DENIED ✓

Drop-in container/apparmor-podman-builder (load with apparmor_parser -r -W, run with --security-opt apparmor=podman-builder):

#include <tunables/global> profile podman-builder flags=(attach_disconnected,mediate_deleted) { #include <abstractions/base> network, capability, file, umount, signal (receive) peer=unconfined, signal (receive) peer=runc, signal (receive) peer=crun, signal (send,receive) peer=podman-builder, deny @{PROC}/* w, deny @{PROC}/sys/[^k]** w, deny @{PROC}/sys/kernel/{?,??,[^s][^h][^m]**} w, deny @{PROC}/sysrq-trigger rwklx, deny @{PROC}/kcore rwklx, mount, # docker-default has `deny mount,` here — the ONLY relaxation pivot_root, deny /sys/[^f]*/** wklx, deny /sys/firmware/** rwklx, deny /sys/kernel/security/** rwklx, ptrace (trace,read,tracedby,readby) peer=podman-builder, }

Strictly tighter than unconfined, and it protects the credential-bearing outer layer — worth doing before Stage 2 ships agent-server into this container. (Caveat already noted in the doc: this is Ubuntu-host-specific; the profile name must be loaded on the host, so it belongs with appx's Stage-3 host provisioning.)

Canvinus · 2026-06-11T21:28:33Z

+# Fedora/quay.io/podman-stable instead ship them with file capabilities, so
+# euid stays 1000 (== owner of the nested userns) and the ownership shortcut
+# applies. Replicate that here:
+RUN chmod u-s /usr/bin/newuidmap /usr/bin/newgidmap \


✅ This is the linchpin and it's correct — matches how quay.io/podman/stable ships these helpers (rpm --setcaps shadow-utils). Confirmed on a real 24.04 host: getcap shows cap_setuid=ep / cap_setgid=ep, and the uid-map step works with apparmor_restrict_unprivileged_userns=1 left at its hardened default.

⚠️ Worth a one-line warning comment here: this approach is incompatible with --security-opt no-new-privileges. Under no_new_privs, execve cannot add file capabilities to the permitted set (kernel no_new_privs doc), so newuidmap would silently lose CAP_SETUID and rootless podman would fail at namespace setup. Someone will eventually add that flag as "hardening" — a comment here will save them a baffling outage.

added comment

Canvinus · 2026-06-11T21:28:33Z

+ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
+# Spike keeps the container alive for `docker exec` iteration; Stage 2 replaces
+# this with the agent-server process.
+CMD ["sleep", "infinity"]


Note for Stage 2: when this CMD is replaced by the agent-server process, Node-as-PID-1 has the same non-reaping / no-signal-handling problem as sleep infinity (see the zombies I found, commented on run-outer.sh L20). Plan to run under --init, or use catatonit/tini as the entrypoint wrapper.

Optional, from the upstream podman image: ENV BUILDAH_ISOLATION=chroot makes nested podman build use chroot isolation (which is effectively what's happening in this unprivileged-nested setup anyway) and avoids a needless inner-userns attempt.

for next stage

Canvinus · 2026-06-11T21:28:33Z

+Remarkably, **no host-level sysctl/apparmor change was required** — the hardened Ubuntu
+defaults (`apparmor_restrict_unprivileged_userns=1`) are left untouched.
+
+## Result summary


⚪ Nit: duplicate ## Result summary heading (the real one is at L22) with a leftover template stub right below it:

## Result summary 

This second heading + the HTML comment can be deleted.

fixed in 572acf0

alexanderkreidich

Reviewing this from the perspective of #3 (the production builder-container implementation in docker/builder/, currently open), since both PRs solve the same problem — rootless podman inside an unprivileged Docker container — independently.

Overall

The spike delivers real value and I'd merge it: SPIKE-FINDINGS.md with the root-cause trace of the newuidmap setuid packaging issue, the deletion-tested flag set, and the seccomp profile with provenance (gen-seccomp.sh) are exactly the knowledge that should live in the repo rather than in a PR description. No file conflicts with #3 (different directories, zero deletions).

Findings here that #3 should adopt (proposed follow-up)

This spike proves that two things #3 currently ships can be tightened — #3 even lists seccomp hardening as a named follow-up, and this PR is that follow-up's content:

newuidmap via file capabilities instead of setuid → drops --cap-add SYS_ADMIN entirely. #3 claims SYS_ADMIN is required for the uid/gid maps; this spike shows it isn't once the file-cap fix (3 Dockerfile lines) is in place.
Tailored seccomp profile (seccomp-builder.json = stock podman profile + 3 ungated syscalls) instead of seccomp=unconfined.
Native overlay storage instead of fuse-overlayfs — ~2× faster per the benchmarks here, and removes the need for --device /dev/fuse.

Suggested sequencing: merge both PRs (order doesn't matter), then a follow-up on top of #3 that ports items 1–3 into docker/builder/run.sh + Dockerfile, re-runs docker/builder/verify.sh with the new flags, and deletes the throwaway container/ rig — keeping SPIKE-FINDINGS.md, moved under docs/. Otherwise the repo ends up with two Dockerfiles and two run scripts for the same thing, and it won't be obvious which one is canonical.

One architectural discrepancy to settle before the follow-up

Port publishing differs between the two PRs, and it's a real design decision, not a detail:

This PR: -p 127.0.0.1:10000-10009 — loopback-only, assumes appx proxies app traffic in.
#3: -p 4001 + -p 3000-3010 — published to the host directly (and verified end-to-end that way, including from outside the VM).

Who terminates app traffic (appx proxy vs. direct host publishing) should be decided once and recorded in docs/architecture/important/builder-container-architecture.md, rather than left as a difference between two run scripts.

alexanderkreidich · 2026-06-11T22:36:45Z

Proposal: lifecycle for container/ after merge (follow-up to my review above)

To avoid ending up with two Dockerfiles and two run scripts for the same thing in the repo (docker/builder/ from #3 and container/ from this PR), I propose we agree on this sequence:

Merge both PRs as-is — there are zero git conflicts (verified with a test merge; the branches touch disjoint paths). This PR lands as the Stage 0 record; Outer builder container: agent-server + rootless podman image #3 lands as the production implementation.
One hardening follow-up PR on top of docker/builder/ that ports the three proven findings from this spike into the canonical rig:
- newuidmap via file capabilities → drop --cap-add SYS_ADMIN
- seccomp-builder.json + gen-seccomp.sh → replace seccomp=unconfined
- native overlay storage → replace fuse-overlayfs and drop --device /dev/fuse
Acceptance: docker/builder/verify.sh passes with the new flags.
In that same follow-up, delete container/ — by then the rig has done its job (hypothesis proven, all useful pieces live in the production code), and keeping it around invites someone to extend the wrong Dockerfile. SPIKE-FINDINGS.md is kept, moved under docs/, since it documents why the production flags are what they are.

Until the follow-up lands, it would help to add a one-line banner at the top of SPIKE-FINDINGS.md (or a container/README) saying the canonical contract is docker/builder/ and container/ is a Stage 0 artifact scheduled for removal — so nothing new gets built on top of it in the meantime.

The only decision this plan deliberately does not make is the port-publishing question (loopback + appx proxy here vs. direct host publishing in #3) — that one needs an explicit call recorded in docs/architecture/important/builder-container-architecture.md.

neuromaxer · 2026-06-12T08:49:11Z

re this

Who terminates app traffic (appx proxy vs. direct host publishing) should be decided once and recorded in docs/architecture/important/builder-container-architecture.md, rather than left as a difference between two run scripts.

In my mind appx terminates the traffic and proxies traffic in @alexanderkreidich

UPD: documented that in 572acf0

- gen-seccomp.sh: strip sethostname/setdomainname/setns from the stock profile's complementary SCMP_ACT_ERRNO rule so they aren't both ALLOWed and denied; regenerate seccomp-builder.json (smoke.sh 11/11) - Dockerfile: warn that the newuidmap file-cap fix is incompatible with --security-opt no-new-privileges - SPIKE-FINDINGS.md: drop duplicate 'Result summary' heading + template stub - builder-container-architecture.md: record decision that appx terminates and proxies app traffic (outer container is loopback-only) - move docs/superpowers/plans/* -> docs/plans/, delete superpowers dir

neuromaxer and others added 12 commits June 11, 2026 17:44

add containerisation plan

a4b57b0

add podman spike files

5fdae2a

spike: record host facts (Ubuntu 26.04, kernel 7.0, apparmor userns r…

2c1bcf7

…estriction active)

update spike brief

a0ea365

≈Merge branch 'containerise-p1' into stage0-spike

209bf02

spike: clear crun runtime state on boot so 'podman start --all' recov…

9e0442c

…ers inner containers after docker restart (Stage 4 recovery mechanism)

spike: T4 pin native rootless overlay, drop --device /dev/fuse

7a19d72

Native overlay works on kernel 7.0 and is ~2.2x faster than fuse-overlayfs (582ms vs 1281ms on a 300-file build), and removes the need for /dev/fuse. fuse-overlayfs left installed as documented fallback only.

spike: update run-outer.sh header (candidate -> final proven set, T2 …

5011ed0

…complete)

Canvinus reviewed Jun 11, 2026

View reviewed changes

alexanderkreidich reviewed Jun 11, 2026

View reviewed changes

neuromaxer added 4 commits June 12, 2026 09:01

update docs and add another spike task for complex inner app container

df819ba

add results of inner app spike

00f1aff

update containerisation plan after inner app container spike

dccebfd

neuromaxer changed the title ~~Containerise~~ Containerise - run spikes on outer and inner containers Jun 12, 2026

neuromaxer merged commit fd31bb7 into main Jun 12, 2026
1 check passed

Conversation

neuromaxer commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stage 0 spike: proving rootless Podman works inside an unprivileged Docker container

What this is

TL;DR of what we found

What to review (in priority order)

Not in scope (deliberately)

Uh oh!

Canvinus left a comment

Choose a reason for hiding this comment

Review — Stage 0 nested-podman spike

Verdict

What I verified (independent reproduction)

Strengths

Findings by severity

Combined-hardening proof

Best-practice cross-check (sources)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexanderkreidich left a comment

Choose a reason for hiding this comment

Overall

Findings here that #3 should adopt (proposed follow-up)

One architectural discrepancy to settle before the follow-up

Uh oh!

alexanderkreidich commented Jun 11, 2026

Uh oh!

neuromaxer commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neuromaxer commented Jun 11, 2026 •

edited

Loading

neuromaxer commented Jun 12, 2026 •

edited

Loading