Skip to content

Containerise - run spikes on outer and inner containers#4

Merged
neuromaxer merged 16 commits into
mainfrom
containerise-p1
Jun 12, 2026
Merged

Containerise - run spikes on outer and inner containers#4
neuromaxer merged 16 commits into
mainfrom
containerise-p1

Conversation

@neuromaxer

@neuromaxer neuromaxer commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Stage 0 spike: proving rootless Podman works inside an unprivileged Docker container

What this is

A spike (investigation, not production code) validating the core assumption behind our containerised-apps architecture: that we can run a Docker "outer" container that itself runs Podman to build and serve user apps — without giving it root or privileged access to the host.

Result: it works, securely and fast. This PR captures the proven recipe + a detailed writeup so we can build Stage 2 (agent-server in the container) with confidence.

Nothing here ships yet — it's all under container/ (throwaway test rig) and docs/.

TL;DR of what we found

  • ✅ Full chain works: host → Docker → rootless Podman → inner app, reachable via published port
  • No --privileged, no dangerous capabilities, runs as a non-root user
  • Zero host changes needed beyond "install Docker" — Ubuntu's strict security defaults stay on
  • ✅ Fast: native overlay storage (~2× faster than the fallback), sub-second startup
  • ✅ Restart recovery works (apps come back after a container restart)

The real blocker turned out not to be AppArmor (what we expected) but a subtle setuid packaging issue with Ubuntu's newuidmap helper — fixed in 3 Dockerfile lines (the same trick the official Podman image uses). Full root-cause trace is in the findings doc.

What to review (in priority order)

  1. container/SPIKE-FINDINGS.md — start here. The complete writeup: the headline fix, every Docker flag justified, security analysis, benchmarks, restart behaviour. Reads top-to-bottom.
  2. container/Dockerfile + container/run-outer.sh — the actual proven artifacts. The 4 required run flags are each documented inline with "what breaks if you remove it."
  3. container/seccomp-builder.json (+ gen-seccomp.sh) — our tailored syscall filter. Note it's stricter than just disabling seccomp — only 3 specific syscalls were unblocked.
  4. container/INNER-APP-SPIKE.md - inner (app) container spike findings

The rest (smoke.sh, entrypoint.sh, the docs/plans/* and docs/architecture/other/* files) is supporting context — the test harness and the multi-stage rollout plan this spike is Stage 0 of.

Not in scope (deliberately)

agent-server integration, the deploy skill, appx wiring — all later stages. This PR only answers "is the nesting safe and viable?" → yes.

neuromaxer and others added 12 commits June 11, 2026 17:44
…check)

Root cause traced with bpftrace: Ubuntu's setuid-root newuidmap runs euid=0
inside the unprivileged outer container, which bypasses the kernel uid_map
ownership shortcut and demands CAP_SYS_ADMIN in the init userns (denied by
docker's cap bounding set). Switching to file caps (cap_setuid+ep) keeps euid
1000 == nested-userns owner, matching Fedora/podman-stable. Also pre-create
volume mountpoints owned by builder so named volumes don't mount root-owned.
- add --device /dev/net/tun (slirp4netns rootless networking needs it)
- clear default_sysctls in containers.conf (crun couldn't write read-only
  /proc/sys/net/ipv4/ping_group_range under the nested netns)
- add --security-opt systempaths=unconfined (docker masks /proc submounts;
  kernel mount_too_revealing() then blocks the inner container's fresh proc
  mount with 'mount proc to proc: Operation not permitted'). No privilege/caps.
- entrypoint wipes stale XDG_RUNTIME_DIR pause-process state on each boot so
  podman survives docker restart
…ers inner containers after docker restart (Stage 4 recovery mechanism)
Native overlay works on kernel 7.0 and is ~2.2x faster than fuse-overlayfs
(582ms vs 1281ms on a 300-file build), and removes the need for /dev/fuse.
fuse-overlayfs left installed as documented fallback only.
Deletion-tested all four security flags; each is required with a distinct
failure (recorded in findings). Replaced seccomp=unconfined with a tailored
profile (container/seccomp-builder.json, gen-seccomp.sh documents provenance):
podman's stock profile + ungating only sethostname/setdomainname/setns, while
keeping bpf/perf_event_open/quotactl/fanotify_init/lookup_dcookie denied.
Strictly tighter than unconfined. Smoke still 11/11.
- Tested podman as outer runtime (installed podman 5.7 on host, recorded):
  rootless podman-outer fails (rootless-in-rootless subuid exhaustion at
  newuidmap); rootful podman-outer works with a SMALLER flag set (no seccomp
  override - podman default allows mount; no systempaths) but needs root +
  DNS config. docker-outer remains the proven path.
- Filled in all findings sections: flag table with exact deletion errors,
  host-prereqs (none beyond docker), host-changes log (all sysctls restored
  to hardened defaults), T4 storage, T3 warmup+restart, port chain, Stage 2
  recommendations.
- Final clean-slate smoke (after docker system prune -af): 11/11 PASS, exit 0.

@Canvinus Canvinus left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Stage 0 nested-podman spike

I reviewed this by independently reproducing the spike on a clean host, not just reading the diff. I stood up a throwaway Hetzner Ubuntu 24.04 box (kernel 6.8, AppArmor on, apparmor_restrict_unprivileged_userns=1) — the exact production target the findings doc says it couldn't use — plus an arm64 Ubuntu 24.04 VM for cross-arch coverage. Installed Docker 29.5.3 on both and ran the full chain.

Verdict

The spike is sound and its conclusion holds. The unprivileged docker → rootless-podman → inner-app chain works on the real production OS, on both x86_64 and arm64, with zero host changes and the hardened userns sysctl left at 1. Nothing ships from this PR (everything is under container/ + docs/), so merge risk is ~zero. Recommend merge after the one should-fix below (a one-line gen-seccomp.sh correctness bug); everything else is forward-looking and can ride into Stage 2.

What I verified (independent reproduction)

Check x86_64 / 24.04 (Hetzner) arm64 / 24.04 (OrbStack)
smoke.sh full chain 11/11 11/11
apparmor=unconfined actually required ✓ deletion test reproduces the exact overlay-mount denial n/a (OrbStack kernel has no AppArmor LSM)
Hardened userns sysctl stays =1 through green runs n/a
uid isolation: inner "root" → host uid 1000, worker → 100100

This closes the doc's own open item ("re-verify on a genuine Ubuntu 24.04 host before production") — done, it passes.

Strengths

  • The headline newuidmap root-cause (Dockerfile L31–33) is correct and matches upstream quay.io/podman/stable (rpm --setcaps shadow-utils). It was bpftrace-traced, not guessed.
  • Every docker run flag is deletion-tested with the exact failure recorded — I reproduced all four independently.
  • Storage-on-a-volume is load-bearing and correctly chosen: without it, podman's overlay store lands on the container's own overlay rootfs → overlay-on-overlay, which the kernel forbids (torvalds/linux@76bc8e2). I confirmed the broken fuse-overlayfs fallback empirically.

Findings by severity

🔴 Should-fix (1)gen-seccomp.sh leaves a contradictory ALLOW+ERRNO rule pair for the same three syscalls (inline comment, with one-line fix).

🟡 Recommend for Stage 2 (forward-looking, all tested)

  • Add --init — I found live zombies (slirp4netns, conmon) in the running container (inline comment has the evidence).
  • Add resource limits + restart policy — container currently runs uncapped with RestartPolicy=no (inline comment).
  • Replace apparmor=unconfined with a tailored profile — I built and validated one (inline comment has the full profile; passes the chain in enforce mode while still denying /proc/sysrq-trigger & /proc/kcore). Closes the deferred TODO at run-outer.sh L23.
  • Optional: drop capabilities to a minimal set — verified the chain works with --cap-drop=ALL + 11 caps vs. the full default 14.

⚪ Nits

  • Duplicate "Result summary" heading + leftover template stub in SPIKE-FINDINGS.md (inline).
  • Docs land in docs/plans/ while main now uses docs/superpowers/plans/ (post-refactor convention).

Combined-hardening proof

I ran all the recommendations together in one container — fixed seccomp + tailored AppArmor (enforcing) + --cap-drop=ALL+minimal caps + --init + --memory=3g --cpus=2 --pids-limit=2048 + --restart=unless-stopped — and re-ran the full chain incl. restart recovery:

COMBINED HARDENING: 15 passed, 0 failed
  pid1 is docker-init (reaper) · apparmor podman-builder (enforce) · CapBnd=880404fb
  podman info/run/build/port-chain · zombie count = 0 · sysrq-trigger+kcore DENIED
  workspace+image-store survive restart · podman start recovers inner app

So none of the hardening conflicts with the working setup — it can be adopted wholesale in Stage 2.

Best-practice cross-check (sources)

Confirmed the recipe against: Dan Walsh — "Podman inside a container", the upstream containers/image_build podman image, the moby docker-default AppArmor template, the kernel no_new_privs & overlayfs docs, capabilities(7), and podman v5.0 release notes. All consistent with the spike. One forward note: podman 5.x switches the rootless default slirp4netnspasta (both still need /dev/net/tun), so a base-image bump warrants a re-smoke.

Comment thread container/gen-seccomp.sh
for s in d['syscalls']:
inc=s.get('includes',{})
if s['action']=='SCMP_ACT_ALLOW' and inc.get('caps')==['CAP_SYS_ADMIN']:
s['names']=[n for n in s['names'] if n in NEED]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Should-fix — a contradictory seccomp rule survives generation.

This loop rewrites only the SCMP_ACT_ALLOW rule, but podman's stock profile also ships a complementary SCMP_ACT_ERRNO rule (excludes.caps=[CAP_SYS_ADMIN]) that also lists sethostname/setdomainname/setns. After this script runs, seccomp-builder.json contains both an ALLOW rule and an ERRNO rule naming those three syscalls. I confirmed this in the generated file — they appear in two separate rules.

Which rule wins on conflicting input is libseccomp/runtime-version-defined, not guaranteed. It happens to resolve ALLOW-wins on Docker 29.5.3 (I verified the chain works on both x86_64 and arm64), but a runtime/libseccomp bump could flip it and break inner-container setup with a confusing Operation not permitted. Since this script's whole job is provenance/correctness, worth fixing.

One-line fix — also strip the names from the deny rule:

for s in d['syscalls']:
    inc=s.get('includes',{})
    if s['action']=='SCMP_ACT_ALLOW' and inc.get('caps')==['CAP_SYS_ADMIN']:
        s['names']=[n for n in s['names'] if n in NEED]
        s.pop('includes',None)
    # also remove them from the complementary ERRNO rule so they aren't both allowed and denied
    if s['action']=='SCMP_ACT_ERRNO' and s.get('excludes',{}).get('caps')==['CAP_SYS_ADMIN']:
        s['names']=[n for n in s['names'] if n not in NEED]

Verified on the box: with this applied, the full nested chain still passes (inner podman run exercises exactly these three syscalls during namespace setup).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 572acf0

Comment thread container/run-outer.sh
docker build -t "$IMAGE" .
docker rm -f "$NAME" 2>/dev/null || true

docker run -d --name "$NAME" \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommend (Stage 2): add --init, resource limits, and a restart policy.

Zombies (real, found it running). With CMD ["sleep","infinity"] (Dockerfile L77) as PID 1, the live container accumulates zombies:

$ docker exec builder-outer ps -eo pid,stat,comm | awk '$2 ~ /Z/'
65 Z    slirp4netns
68 Zs   conmon

PID 1 is sleep, which reaps nothing and catches no signals (/proc/1/statusSigCgt=0, so docker stop always hits the 10 s SIGKILL timeout). podman detaches a conmon per inner container that reparents to PID 1. docker run --init (tini) fixes both — PID 1 becomes docker-init and zombie count drops to 0 (tested). This matters more at Stage 2: Node-as-PID-1 (agent-server) has the same non-reaping problem.

Limits + restart. docker inspect shows Memory=0, NanoCpus=0, PidsLimit=null, RestartPolicy=no — a runaway inner build/fork starves the host and every other project, and the outer container won't return after a host reboot. Suggest:

--init --memory=3g --cpus=2 --pids-limit=2048 --restart=unless-stopped

The plan defers these to Stage 4, but they're free to add now and I verified they don't disturb the chain (combined run = 15/15).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do in next stage

Comment thread container/run-outer.sh
docker run -d --name "$NAME" \
--device /dev/net/tun \
--security-opt seccomp="$SECCOMP" \
--security-opt apparmor=unconfined \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The deferred AppArmor TODO on this line is solvable now — I built and tested the profile.

docker-default's only rule that blocks this workload is deny mount,. A tailored profile = docker-default verbatim with that one line swapped for mount, + pivot_root,, keeping every other protection. Loaded on the Hetzner box in enforce mode:

AppArmor on pid1: podman-builder (enforce)
info: ok   build: ok   run+fwd: ok
/proc/sysrq-trigger write: DENIED ✓
/proc/kcore read:          DENIED ✓

Drop-in container/apparmor-podman-builder (load with apparmor_parser -r -W, run with --security-opt apparmor=podman-builder):

#include <tunables/global>
profile podman-builder flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  network, capability, file, umount,
  signal (receive) peer=unconfined,
  signal (receive) peer=runc,
  signal (receive) peer=crun,
  signal (send,receive) peer=podman-builder,
  deny @{PROC}/* w,
  deny @{PROC}/sys/[^k]** w,
  deny @{PROC}/sys/kernel/{?,??,[^s][^h][^m]**} w,
  deny @{PROC}/sysrq-trigger rwklx,
  deny @{PROC}/kcore rwklx,
  mount,            # docker-default has `deny mount,` here — the ONLY relaxation
  pivot_root,
  deny /sys/[^f]*/** wklx,
  deny /sys/firmware/** rwklx,
  deny /sys/kernel/security/** rwklx,
  ptrace (trace,read,tracedby,readby) peer=podman-builder,
}

Strictly tighter than unconfined, and it protects the credential-bearing outer layer — worth doing before Stage 2 ships agent-server into this container. (Caveat already noted in the doc: this is Ubuntu-host-specific; the profile name must be loaded on the host, so it belongs with appx's Stage-3 host provisioning.)

Comment thread container/Dockerfile
# Fedora/quay.io/podman-stable instead ship them with file capabilities, so
# euid stays 1000 (== owner of the nested userns) and the ownership shortcut
# applies. Replicate that here:
RUN chmod u-s /usr/bin/newuidmap /usr/bin/newgidmap \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the linchpin and it's correct — matches how quay.io/podman/stable ships these helpers (rpm --setcaps shadow-utils). Confirmed on a real 24.04 host: getcap shows cap_setuid=ep / cap_setgid=ep, and the uid-map step works with apparmor_restrict_unprivileged_userns=1 left at its hardened default.

⚠️ Worth a one-line warning comment here: this approach is incompatible with --security-opt no-new-privileges. Under no_new_privs, execve cannot add file capabilities to the permitted set (kernel no_new_privs doc), so newuidmap would silently lose CAP_SETUID and rootless podman would fail at namespace setup. Someone will eventually add that flag as "hardening" — a comment here will save them a baffling outage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment

Comment thread container/Dockerfile
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
# Spike keeps the container alive for `docker exec` iteration; Stage 2 replaces
# this with the agent-server process.
CMD ["sleep", "infinity"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for Stage 2: when this CMD is replaced by the agent-server process, Node-as-PID-1 has the same non-reaping / no-signal-handling problem as sleep infinity (see the zombies I found, commented on run-outer.sh L20). Plan to run under --init, or use catatonit/tini as the entrypoint wrapper.

Optional, from the upstream podman image: ENV BUILDAH_ISOLATION=chroot makes nested podman build use chroot isolation (which is effectively what's happening in this unprivileged-nested setup anyway) and avoids a needless inner-userns attempt.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for next stage

Comment thread container/SPIKE-FINDINGS.md Outdated
Remarkably, **no host-level sysctl/apparmor change was required** — the hardened Ubuntu
defaults (`apparmor_restrict_unprivileged_userns=1`) are left untouched.

## Result summary

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ Nit: duplicate ## Result summary heading (the real one is at L22) with a leftover template stub right below it:

## Result summary
<!-- One paragraph: does the unprivileged nested chain work on this host? -->

This second heading + the HTML comment can be deleted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 572acf0

@alexanderkreidich alexanderkreidich left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing this from the perspective of #3 (the production builder-container implementation in docker/builder/, currently open), since both PRs solve the same problem — rootless podman inside an unprivileged Docker container — independently.

Overall

The spike delivers real value and I'd merge it: SPIKE-FINDINGS.md with the root-cause trace of the newuidmap setuid packaging issue, the deletion-tested flag set, and the seccomp profile with provenance (gen-seccomp.sh) are exactly the knowledge that should live in the repo rather than in a PR description. No file conflicts with #3 (different directories, zero deletions).

Findings here that #3 should adopt (proposed follow-up)

This spike proves that two things #3 currently ships can be tightened — #3 even lists seccomp hardening as a named follow-up, and this PR is that follow-up's content:

  1. newuidmap via file capabilities instead of setuid → drops --cap-add SYS_ADMIN entirely. #3 claims SYS_ADMIN is required for the uid/gid maps; this spike shows it isn't once the file-cap fix (3 Dockerfile lines) is in place.
  2. Tailored seccomp profile (seccomp-builder.json = stock podman profile + 3 ungated syscalls) instead of seccomp=unconfined.
  3. Native overlay storage instead of fuse-overlayfs — ~2× faster per the benchmarks here, and removes the need for --device /dev/fuse.

Suggested sequencing: merge both PRs (order doesn't matter), then a follow-up on top of #3 that ports items 1–3 into docker/builder/run.sh + Dockerfile, re-runs docker/builder/verify.sh with the new flags, and deletes the throwaway container/ rig — keeping SPIKE-FINDINGS.md, moved under docs/. Otherwise the repo ends up with two Dockerfiles and two run scripts for the same thing, and it won't be obvious which one is canonical.

One architectural discrepancy to settle before the follow-up

Port publishing differs between the two PRs, and it's a real design decision, not a detail:

  • This PR: -p 127.0.0.1:10000-10009 — loopback-only, assumes appx proxies app traffic in.
  • #3: -p 4001 + -p 3000-3010 — published to the host directly (and verified end-to-end that way, including from outside the VM).

Who terminates app traffic (appx proxy vs. direct host publishing) should be decided once and recorded in docs/architecture/important/builder-container-architecture.md, rather than left as a difference between two run scripts.

@alexanderkreidich

Copy link
Copy Markdown

Proposal: lifecycle for container/ after merge (follow-up to my review above)

To avoid ending up with two Dockerfiles and two run scripts for the same thing in the repo (docker/builder/ from #3 and container/ from this PR), I propose we agree on this sequence:

  1. Merge both PRs as-is — there are zero git conflicts (verified with a test merge; the branches touch disjoint paths). This PR lands as the Stage 0 record; Outer builder container: agent-server + rootless podman image #3 lands as the production implementation.

  2. One hardening follow-up PR on top of docker/builder/ that ports the three proven findings from this spike into the canonical rig:

    • newuidmap via file capabilities → drop --cap-add SYS_ADMIN
    • seccomp-builder.json + gen-seccomp.sh → replace seccomp=unconfined
    • native overlay storage → replace fuse-overlayfs and drop --device /dev/fuse

    Acceptance: docker/builder/verify.sh passes with the new flags.

  3. In that same follow-up, delete container/ — by then the rig has done its job (hypothesis proven, all useful pieces live in the production code), and keeping it around invites someone to extend the wrong Dockerfile. SPIKE-FINDINGS.md is kept, moved under docs/, since it documents why the production flags are what they are.

Until the follow-up lands, it would help to add a one-line banner at the top of SPIKE-FINDINGS.md (or a container/README) saying the canonical contract is docker/builder/ and container/ is a Stage 0 artifact scheduled for removal — so nothing new gets built on top of it in the meantime.

The only decision this plan deliberately does not make is the port-publishing question (loopback + appx proxy here vs. direct host publishing in #3) — that one needs an explicit call recorded in docs/architecture/important/builder-container-architecture.md.

@neuromaxer

neuromaxer commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

re this

Who terminates app traffic (appx proxy vs. direct host publishing) should be decided once and recorded in docs/architecture/important/builder-container-architecture.md, rather than left as a difference between two run scripts.

In my mind appx terminates the traffic and proxies traffic in @alexanderkreidich

UPD: documented that in 572acf0

- gen-seccomp.sh: strip sethostname/setdomainname/setns from the stock
  profile's complementary SCMP_ACT_ERRNO rule so they aren't both ALLOWed
  and denied; regenerate seccomp-builder.json (smoke.sh 11/11)
- Dockerfile: warn that the newuidmap file-cap fix is incompatible with
  --security-opt no-new-privileges
- SPIKE-FINDINGS.md: drop duplicate 'Result summary' heading + template stub
- builder-container-architecture.md: record decision that appx terminates
  and proxies app traffic (outer container is loopback-only)
- move docs/superpowers/plans/* -> docs/plans/, delete superpowers dir
@neuromaxer neuromaxer changed the title Containerise Containerise - run spikes on outer and inner containers Jun 12, 2026
@neuromaxer neuromaxer merged commit fd31bb7 into main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants