Containerise - run spikes on outer and inner containers#4
Conversation
…estriction active)
…check) Root cause traced with bpftrace: Ubuntu's setuid-root newuidmap runs euid=0 inside the unprivileged outer container, which bypasses the kernel uid_map ownership shortcut and demands CAP_SYS_ADMIN in the init userns (denied by docker's cap bounding set). Switching to file caps (cap_setuid+ep) keeps euid 1000 == nested-userns owner, matching Fedora/podman-stable. Also pre-create volume mountpoints owned by builder so named volumes don't mount root-owned.
- add --device /dev/net/tun (slirp4netns rootless networking needs it) - clear default_sysctls in containers.conf (crun couldn't write read-only /proc/sys/net/ipv4/ping_group_range under the nested netns) - add --security-opt systempaths=unconfined (docker masks /proc submounts; kernel mount_too_revealing() then blocks the inner container's fresh proc mount with 'mount proc to proc: Operation not permitted'). No privilege/caps. - entrypoint wipes stale XDG_RUNTIME_DIR pause-process state on each boot so podman survives docker restart
…ers inner containers after docker restart (Stage 4 recovery mechanism)
Native overlay works on kernel 7.0 and is ~2.2x faster than fuse-overlayfs (582ms vs 1281ms on a 300-file build), and removes the need for /dev/fuse. fuse-overlayfs left installed as documented fallback only.
Deletion-tested all four security flags; each is required with a distinct failure (recorded in findings). Replaced seccomp=unconfined with a tailored profile (container/seccomp-builder.json, gen-seccomp.sh documents provenance): podman's stock profile + ungating only sethostname/setdomainname/setns, while keeping bpf/perf_event_open/quotactl/fanotify_init/lookup_dcookie denied. Strictly tighter than unconfined. Smoke still 11/11.
- Tested podman as outer runtime (installed podman 5.7 on host, recorded): rootless podman-outer fails (rootless-in-rootless subuid exhaustion at newuidmap); rootful podman-outer works with a SMALLER flag set (no seccomp override - podman default allows mount; no systempaths) but needs root + DNS config. docker-outer remains the proven path. - Filled in all findings sections: flag table with exact deletion errors, host-prereqs (none beyond docker), host-changes log (all sysctls restored to hardened defaults), T4 storage, T3 warmup+restart, port chain, Stage 2 recommendations. - Final clean-slate smoke (after docker system prune -af): 11/11 PASS, exit 0.
Canvinus
left a comment
There was a problem hiding this comment.
Review — Stage 0 nested-podman spike
I reviewed this by independently reproducing the spike on a clean host, not just reading the diff. I stood up a throwaway Hetzner Ubuntu 24.04 box (kernel 6.8, AppArmor on, apparmor_restrict_unprivileged_userns=1) — the exact production target the findings doc says it couldn't use — plus an arm64 Ubuntu 24.04 VM for cross-arch coverage. Installed Docker 29.5.3 on both and ran the full chain.
Verdict
The spike is sound and its conclusion holds. The unprivileged docker → rootless-podman → inner-app chain works on the real production OS, on both x86_64 and arm64, with zero host changes and the hardened userns sysctl left at 1. Nothing ships from this PR (everything is under container/ + docs/), so merge risk is ~zero. Recommend merge after the one should-fix below (a one-line gen-seccomp.sh correctness bug); everything else is forward-looking and can ride into Stage 2.
What I verified (independent reproduction)
| Check | x86_64 / 24.04 (Hetzner) | arm64 / 24.04 (OrbStack) |
|---|---|---|
smoke.sh full chain |
11/11 | 11/11 |
apparmor=unconfined actually required |
✓ deletion test reproduces the exact overlay-mount denial | n/a (OrbStack kernel has no AppArmor LSM) |
Hardened userns sysctl stays =1 through green runs |
✓ | n/a |
| uid isolation: inner "root" → host uid 1000, worker → 100100 | ✓ | ✓ |
This closes the doc's own open item ("re-verify on a genuine Ubuntu 24.04 host before production") — done, it passes.
Strengths
- The headline
newuidmaproot-cause (DockerfileL31–33) is correct and matches upstreamquay.io/podman/stable(rpm --setcaps shadow-utils). It was bpftrace-traced, not guessed. - Every
docker runflag is deletion-tested with the exact failure recorded — I reproduced all four independently. - Storage-on-a-volume is load-bearing and correctly chosen: without it, podman's overlay store lands on the container's own overlay rootfs → overlay-on-overlay, which the kernel forbids (torvalds/linux@76bc8e2). I confirmed the broken fuse-overlayfs fallback empirically.
Findings by severity
🔴 Should-fix (1) — gen-seccomp.sh leaves a contradictory ALLOW+ERRNO rule pair for the same three syscalls (inline comment, with one-line fix).
🟡 Recommend for Stage 2 (forward-looking, all tested)
- Add
--init— I found live zombies (slirp4netns,conmon) in the running container (inline comment has the evidence). - Add resource limits + restart policy — container currently runs uncapped with
RestartPolicy=no(inline comment). - Replace
apparmor=unconfinedwith a tailored profile — I built and validated one (inline comment has the full profile; passes the chain in enforce mode while still denying/proc/sysrq-trigger&/proc/kcore). Closes the deferred TODO atrun-outer.shL23. - Optional: drop capabilities to a minimal set — verified the chain works with
--cap-drop=ALL+ 11 caps vs. the full default 14.
⚪ Nits
- Duplicate "Result summary" heading + leftover template stub in
SPIKE-FINDINGS.md(inline). - Docs land in
docs/plans/whilemainnow usesdocs/superpowers/plans/(post-refactor convention).
Combined-hardening proof
I ran all the recommendations together in one container — fixed seccomp + tailored AppArmor (enforcing) + --cap-drop=ALL+minimal caps + --init + --memory=3g --cpus=2 --pids-limit=2048 + --restart=unless-stopped — and re-ran the full chain incl. restart recovery:
COMBINED HARDENING: 15 passed, 0 failed
pid1 is docker-init (reaper) · apparmor podman-builder (enforce) · CapBnd=880404fb
podman info/run/build/port-chain · zombie count = 0 · sysrq-trigger+kcore DENIED
workspace+image-store survive restart · podman start recovers inner app
So none of the hardening conflicts with the working setup — it can be adopted wholesale in Stage 2.
Best-practice cross-check (sources)
Confirmed the recipe against: Dan Walsh — "Podman inside a container", the upstream containers/image_build podman image, the moby docker-default AppArmor template, the kernel no_new_privs & overlayfs docs, capabilities(7), and podman v5.0 release notes. All consistent with the spike. One forward note: podman 5.x switches the rootless default slirp4netns→pasta (both still need /dev/net/tun), so a base-image bump warrants a re-smoke.
| for s in d['syscalls']: | ||
| inc=s.get('includes',{}) | ||
| if s['action']=='SCMP_ACT_ALLOW' and inc.get('caps')==['CAP_SYS_ADMIN']: | ||
| s['names']=[n for n in s['names'] if n in NEED] |
There was a problem hiding this comment.
🔴 Should-fix — a contradictory seccomp rule survives generation.
This loop rewrites only the SCMP_ACT_ALLOW rule, but podman's stock profile also ships a complementary SCMP_ACT_ERRNO rule (excludes.caps=[CAP_SYS_ADMIN]) that also lists sethostname/setdomainname/setns. After this script runs, seccomp-builder.json contains both an ALLOW rule and an ERRNO rule naming those three syscalls. I confirmed this in the generated file — they appear in two separate rules.
Which rule wins on conflicting input is libseccomp/runtime-version-defined, not guaranteed. It happens to resolve ALLOW-wins on Docker 29.5.3 (I verified the chain works on both x86_64 and arm64), but a runtime/libseccomp bump could flip it and break inner-container setup with a confusing Operation not permitted. Since this script's whole job is provenance/correctness, worth fixing.
One-line fix — also strip the names from the deny rule:
for s in d['syscalls']:
inc=s.get('includes',{})
if s['action']=='SCMP_ACT_ALLOW' and inc.get('caps')==['CAP_SYS_ADMIN']:
s['names']=[n for n in s['names'] if n in NEED]
s.pop('includes',None)
# also remove them from the complementary ERRNO rule so they aren't both allowed and denied
if s['action']=='SCMP_ACT_ERRNO' and s.get('excludes',{}).get('caps')==['CAP_SYS_ADMIN']:
s['names']=[n for n in s['names'] if n not in NEED]Verified on the box: with this applied, the full nested chain still passes (inner podman run exercises exactly these three syscalls during namespace setup).
| docker build -t "$IMAGE" . | ||
| docker rm -f "$NAME" 2>/dev/null || true | ||
|
|
||
| docker run -d --name "$NAME" \ |
There was a problem hiding this comment.
🟡 Recommend (Stage 2): add --init, resource limits, and a restart policy.
Zombies (real, found it running). With CMD ["sleep","infinity"] (Dockerfile L77) as PID 1, the live container accumulates zombies:
$ docker exec builder-outer ps -eo pid,stat,comm | awk '$2 ~ /Z/'
65 Z slirp4netns
68 Zs conmon
PID 1 is sleep, which reaps nothing and catches no signals (/proc/1/status → SigCgt=0, so docker stop always hits the 10 s SIGKILL timeout). podman detaches a conmon per inner container that reparents to PID 1. docker run --init (tini) fixes both — PID 1 becomes docker-init and zombie count drops to 0 (tested). This matters more at Stage 2: Node-as-PID-1 (agent-server) has the same non-reaping problem.
Limits + restart. docker inspect shows Memory=0, NanoCpus=0, PidsLimit=null, RestartPolicy=no — a runaway inner build/fork starves the host and every other project, and the outer container won't return after a host reboot. Suggest:
--init --memory=3g --cpus=2 --pids-limit=2048 --restart=unless-stopped
The plan defers these to Stage 4, but they're free to add now and I verified they don't disturb the chain (combined run = 15/15).
There was a problem hiding this comment.
will do in next stage
| docker run -d --name "$NAME" \ | ||
| --device /dev/net/tun \ | ||
| --security-opt seccomp="$SECCOMP" \ | ||
| --security-opt apparmor=unconfined \ |
There was a problem hiding this comment.
🟡 The deferred AppArmor TODO on this line is solvable now — I built and tested the profile.
docker-default's only rule that blocks this workload is deny mount,. A tailored profile = docker-default verbatim with that one line swapped for mount, + pivot_root,, keeping every other protection. Loaded on the Hetzner box in enforce mode:
AppArmor on pid1: podman-builder (enforce)
info: ok build: ok run+fwd: ok
/proc/sysrq-trigger write: DENIED ✓
/proc/kcore read: DENIED ✓
Drop-in container/apparmor-podman-builder (load with apparmor_parser -r -W, run with --security-opt apparmor=podman-builder):
#include <tunables/global>
profile podman-builder flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
network, capability, file, umount,
signal (receive) peer=unconfined,
signal (receive) peer=runc,
signal (receive) peer=crun,
signal (send,receive) peer=podman-builder,
deny @{PROC}/* w,
deny @{PROC}/sys/[^k]** w,
deny @{PROC}/sys/kernel/{?,??,[^s][^h][^m]**} w,
deny @{PROC}/sysrq-trigger rwklx,
deny @{PROC}/kcore rwklx,
mount, # docker-default has `deny mount,` here — the ONLY relaxation
pivot_root,
deny /sys/[^f]*/** wklx,
deny /sys/firmware/** rwklx,
deny /sys/kernel/security/** rwklx,
ptrace (trace,read,tracedby,readby) peer=podman-builder,
}
Strictly tighter than unconfined, and it protects the credential-bearing outer layer — worth doing before Stage 2 ships agent-server into this container. (Caveat already noted in the doc: this is Ubuntu-host-specific; the profile name must be loaded on the host, so it belongs with appx's Stage-3 host provisioning.)
| # Fedora/quay.io/podman-stable instead ship them with file capabilities, so | ||
| # euid stays 1000 (== owner of the nested userns) and the ownership shortcut | ||
| # applies. Replicate that here: | ||
| RUN chmod u-s /usr/bin/newuidmap /usr/bin/newgidmap \ |
There was a problem hiding this comment.
✅ This is the linchpin and it's correct — matches how quay.io/podman/stable ships these helpers (rpm --setcaps shadow-utils). Confirmed on a real 24.04 host: getcap shows cap_setuid=ep / cap_setgid=ep, and the uid-map step works with apparmor_restrict_unprivileged_userns=1 left at its hardened default.
--security-opt no-new-privileges. Under no_new_privs, execve cannot add file capabilities to the permitted set (kernel no_new_privs doc), so newuidmap would silently lose CAP_SETUID and rootless podman would fail at namespace setup. Someone will eventually add that flag as "hardening" — a comment here will save them a baffling outage.
| ENTRYPOINT ["/usr/local/bin/entrypoint.sh"] | ||
| # Spike keeps the container alive for `docker exec` iteration; Stage 2 replaces | ||
| # this with the agent-server process. | ||
| CMD ["sleep", "infinity"] |
There was a problem hiding this comment.
Note for Stage 2: when this CMD is replaced by the agent-server process, Node-as-PID-1 has the same non-reaping / no-signal-handling problem as sleep infinity (see the zombies I found, commented on run-outer.sh L20). Plan to run under --init, or use catatonit/tini as the entrypoint wrapper.
Optional, from the upstream podman image: ENV BUILDAH_ISOLATION=chroot makes nested podman build use chroot isolation (which is effectively what's happening in this unprivileged-nested setup anyway) and avoids a needless inner-userns attempt.
| Remarkably, **no host-level sysctl/apparmor change was required** — the hardened Ubuntu | ||
| defaults (`apparmor_restrict_unprivileged_userns=1`) are left untouched. | ||
|
|
||
| ## Result summary |
There was a problem hiding this comment.
⚪ Nit: duplicate ## Result summary heading (the real one is at L22) with a leftover template stub right below it:
## Result summary
<!-- One paragraph: does the unprivileged nested chain work on this host? -->
This second heading + the HTML comment can be deleted.
alexanderkreidich
left a comment
There was a problem hiding this comment.
Reviewing this from the perspective of #3 (the production builder-container implementation in docker/builder/, currently open), since both PRs solve the same problem — rootless podman inside an unprivileged Docker container — independently.
Overall
The spike delivers real value and I'd merge it: SPIKE-FINDINGS.md with the root-cause trace of the newuidmap setuid packaging issue, the deletion-tested flag set, and the seccomp profile with provenance (gen-seccomp.sh) are exactly the knowledge that should live in the repo rather than in a PR description. No file conflicts with #3 (different directories, zero deletions).
Findings here that #3 should adopt (proposed follow-up)
This spike proves that two things #3 currently ships can be tightened — #3 even lists seccomp hardening as a named follow-up, and this PR is that follow-up's content:
newuidmapvia file capabilities instead of setuid → drops--cap-add SYS_ADMINentirely. #3 claims SYS_ADMIN is required for the uid/gid maps; this spike shows it isn't once the file-cap fix (3 Dockerfile lines) is in place.- Tailored seccomp profile (
seccomp-builder.json= stock podman profile + 3 ungated syscalls) instead ofseccomp=unconfined. - Native overlay storage instead of fuse-overlayfs — ~2× faster per the benchmarks here, and removes the need for
--device /dev/fuse.
Suggested sequencing: merge both PRs (order doesn't matter), then a follow-up on top of #3 that ports items 1–3 into docker/builder/run.sh + Dockerfile, re-runs docker/builder/verify.sh with the new flags, and deletes the throwaway container/ rig — keeping SPIKE-FINDINGS.md, moved under docs/. Otherwise the repo ends up with two Dockerfiles and two run scripts for the same thing, and it won't be obvious which one is canonical.
One architectural discrepancy to settle before the follow-up
Port publishing differs between the two PRs, and it's a real design decision, not a detail:
- This PR:
-p 127.0.0.1:10000-10009— loopback-only, assumes appx proxies app traffic in. - #3:
-p 4001+-p 3000-3010— published to the host directly (and verified end-to-end that way, including from outside the VM).
Who terminates app traffic (appx proxy vs. direct host publishing) should be decided once and recorded in docs/architecture/important/builder-container-architecture.md, rather than left as a difference between two run scripts.
|
Proposal: lifecycle for To avoid ending up with two Dockerfiles and two run scripts for the same thing in the repo (
Until the follow-up lands, it would help to add a one-line banner at the top of The only decision this plan deliberately does not make is the port-publishing question (loopback + appx proxy here vs. direct host publishing in #3) — that one needs an explicit call recorded in |
|
re this
In my mind appx terminates the traffic and proxies traffic in @alexanderkreidich UPD: documented that in 572acf0 |
- gen-seccomp.sh: strip sethostname/setdomainname/setns from the stock profile's complementary SCMP_ACT_ERRNO rule so they aren't both ALLOWed and denied; regenerate seccomp-builder.json (smoke.sh 11/11) - Dockerfile: warn that the newuidmap file-cap fix is incompatible with --security-opt no-new-privileges - SPIKE-FINDINGS.md: drop duplicate 'Result summary' heading + template stub - builder-container-architecture.md: record decision that appx terminates and proxies app traffic (outer container is loopback-only) - move docs/superpowers/plans/* -> docs/plans/, delete superpowers dir
Stage 0 spike: proving rootless Podman works inside an unprivileged Docker container
What this is
A spike (investigation, not production code) validating the core assumption behind our containerised-apps architecture: that we can run a Docker "outer" container that itself runs Podman to build and serve user apps — without giving it root or privileged access to the host.
Result: it works, securely and fast. This PR captures the proven recipe + a detailed writeup so we can build Stage 2 (agent-server in the container) with confidence.
Nothing here ships yet — it's all under
container/(throwaway test rig) anddocs/.TL;DR of what we found
--privileged, no dangerous capabilities, runs as a non-root userThe real blocker turned out not to be AppArmor (what we expected) but a subtle
setuidpackaging issue with Ubuntu'snewuidmaphelper — fixed in 3 Dockerfile lines (the same trick the official Podman image uses). Full root-cause trace is in the findings doc.What to review (in priority order)
container/SPIKE-FINDINGS.md— start here. The complete writeup: the headline fix, every Docker flag justified, security analysis, benchmarks, restart behaviour. Reads top-to-bottom.container/Dockerfile+container/run-outer.sh— the actual proven artifacts. The 4 requiredrunflags are each documented inline with "what breaks if you remove it."container/seccomp-builder.json(+gen-seccomp.sh) — our tailored syscall filter. Note it's stricter than just disabling seccomp — only 3 specific syscalls were unblocked.container/INNER-APP-SPIKE.md- inner (app) container spike findingsThe rest (
smoke.sh,entrypoint.sh, thedocs/plans/*anddocs/architecture/other/*files) is supporting context — the test harness and the multi-stage rollout plan this spike is Stage 0 of.Not in scope (deliberately)
agent-server integration, the deploy skill, appx wiring — all later stages. This PR only answers "is the nesting safe and viable?" → yes.