-
Notifications
You must be signed in to change notification settings - Fork 287
Description
Problem
The Navigator k3s cluster currently runs as a privileged Docker container with k3s's embedded containerd using the default runc runtime. While sandbox pods already enforce Landlock, seccomp, and network namespace isolation at the process level, the sandbox containers themselves share the host kernel via the standard OCI runtime. Evaluating stronger isolation boundaries would improve the security posture for running untrusted agent code.
The SandboxTemplate proto already has a runtime_class_name field wired through to the pod spec (navigator-server/src/sandbox/mod.rs:512), so partial plumbing for alternative runtimes exists.
Options to Evaluate
Option 1: gVisor (runsc) as containerd runtime (Recommended starting point)
gVisor is a userspace application kernel written in Go that intercepts all syscalls and implements them without passing them to the host kernel. It integrates with containerd via a shim and supports Kubernetes RuntimeClass natively.
Pros:
- Lightweight, fast startup, process-like resource model (no fixed guest resources)
- Battle-tested at scale (powers GKE Sandbox at Google)
- Native Kubernetes
RuntimeClassintegration -- our existingruntime_class_namefield would work directly - Simple integration: install
runscbinary in the cluster image, add containerd runtime config, create aRuntimeClassmanifest - Supports amd64 and arm64
- Defense-in-depth: gVisor also applies seccomp internally to its own Sentry process
Cons:
- Not full VM-level isolation (the Sentry runs as a userspace process on the host kernel, though with a very restricted syscall surface)
- Some Linux syscall compatibility gaps -- needs testing with our sandbox workloads (Python, Node.js, coding agents)
- Higher per-syscall overhead for syscall-heavy workloads
- May conflict with or be redundant alongside our existing Landlock/seccomp enforcement
Integration sketch:
- Add
runscbinary toDockerfile.cluster - Add containerd runtime config for
runschandler - Deploy a
RuntimeClassnamedgvisorin the k3s manifests - Set
runtime_class_name: "gvisor"on sandbox templates (already supported in proto/server) - Test sandbox workloads for compatibility
Option 2: Firecracker microVMs via firecracker-containerd
firecracker-containerd enables containerd to run containers inside Firecracker microVMs, providing hardware-level isolation via KVM. Each container gets its own lightweight VM (~125ms boot, ~5MB overhead).
Pros:
- Strongest isolation model (hardware-level via KVM hypervisor)
- Proven at massive scale (AWS Lambda, Fargate)
- Minimal overhead per microVM
Cons:
- Requires
/dev/kvm-- problematic for our Docker-in-Docker setup and essentially impossible on macOS (no nested KVM). Would likely require running the cluster on bare-metal Linux or in a VM with nested virtualization enabled. - Requires a custom containerd binary (firecracker control plugin compiles into containerd), a VM runtime shim, an in-VM agent, and a custom root filesystem image builder -- significantly more complex than gVisor
- Heavier operational burden: must build/maintain a microVM root filesystem image containing runc and the firecracker agent
- Go-only ecosystem; project maintenance cadence should be evaluated
- Does not integrate via standard Kubernetes
RuntimeClass-- uses its own containerd plugin/API
Integration sketch:
- Replace or heavily modify the k3s embedded containerd with firecracker-containerd
- Build a microVM root filesystem with runc + firecracker agent
- Configure CNI networking (tc-redirect-tap plugin)
- Significant changes to cluster image and entrypoint
- Would not work for local macOS development without nested virtualization
Option 3: Run the entire k3s cluster inside a VM
Rather than changing the container runtime, run the entire privileged k3s container inside a lightweight VM (QEMU, Cloud Hypervisor, Lima on macOS, etc.).
Pros:
- Isolates the entire cluster, including the privileged Docker container surface
- Defense-in-depth: even a container escape stays within the VM
- No changes to the container runtime or sandbox pod configuration
Cons:
- Adds startup latency and resource overhead
- More complex lifecycle management (VM provisioning, networking, volume mounting)
- Complicates local development, especially on macOS (Lima/colima already provides a Linux VM for Docker)
- Does not improve isolation between sandbox pods (all still share the same runc runtime inside the VM)
Recommendation
Start with gVisor (Option 1) as it offers the best balance of security improvement, integration simplicity, and compatibility with our existing architecture. It works within our current k3s-in-Docker model, leverages the existing runtime_class_name plumbing, and doesn't require KVM or nested virtualization.
Option 3 (cluster-in-VM) could be pursued independently as an additional layer. Option 2 (Firecracker) is the strongest isolation but has the highest integration cost and significant platform constraints.
Acceptance Criteria
- Document findings from evaluating each option against our workloads
- Prototype gVisor integration in the k3s cluster image
- Test sandbox compatibility (Python 3.12, Node.js, coding agents) under gVisor
- Measure performance impact (sandbox startup time, steady-state overhead)
- Decision document with chosen approach and migration plan
Originally by @drew on 2026-02-10T13:44:36.529-08:00