Skip to content

Evaluate sandbox isolation options: gVisor runtime, Firecracker microVMs, or cluster-in-VM #4

@pimlock

Description

@pimlock

Problem

The Navigator k3s cluster currently runs as a privileged Docker container with k3s's embedded containerd using the default runc runtime. While sandbox pods already enforce Landlock, seccomp, and network namespace isolation at the process level, the sandbox containers themselves share the host kernel via the standard OCI runtime. Evaluating stronger isolation boundaries would improve the security posture for running untrusted agent code.

The SandboxTemplate proto already has a runtime_class_name field wired through to the pod spec (navigator-server/src/sandbox/mod.rs:512), so partial plumbing for alternative runtimes exists.

Options to Evaluate

Option 1: gVisor (runsc) as containerd runtime (Recommended starting point)

gVisor is a userspace application kernel written in Go that intercepts all syscalls and implements them without passing them to the host kernel. It integrates with containerd via a shim and supports Kubernetes RuntimeClass natively.

Pros:

  • Lightweight, fast startup, process-like resource model (no fixed guest resources)
  • Battle-tested at scale (powers GKE Sandbox at Google)
  • Native Kubernetes RuntimeClass integration -- our existing runtime_class_name field would work directly
  • Simple integration: install runsc binary in the cluster image, add containerd runtime config, create a RuntimeClass manifest
  • Supports amd64 and arm64
  • Defense-in-depth: gVisor also applies seccomp internally to its own Sentry process

Cons:

  • Not full VM-level isolation (the Sentry runs as a userspace process on the host kernel, though with a very restricted syscall surface)
  • Some Linux syscall compatibility gaps -- needs testing with our sandbox workloads (Python, Node.js, coding agents)
  • Higher per-syscall overhead for syscall-heavy workloads
  • May conflict with or be redundant alongside our existing Landlock/seccomp enforcement

Integration sketch:

  1. Add runsc binary to Dockerfile.cluster
  2. Add containerd runtime config for runsc handler
  3. Deploy a RuntimeClass named gvisor in the k3s manifests
  4. Set runtime_class_name: "gvisor" on sandbox templates (already supported in proto/server)
  5. Test sandbox workloads for compatibility

Option 2: Firecracker microVMs via firecracker-containerd

firecracker-containerd enables containerd to run containers inside Firecracker microVMs, providing hardware-level isolation via KVM. Each container gets its own lightweight VM (~125ms boot, ~5MB overhead).

Pros:

  • Strongest isolation model (hardware-level via KVM hypervisor)
  • Proven at massive scale (AWS Lambda, Fargate)
  • Minimal overhead per microVM

Cons:

  • Requires /dev/kvm -- problematic for our Docker-in-Docker setup and essentially impossible on macOS (no nested KVM). Would likely require running the cluster on bare-metal Linux or in a VM with nested virtualization enabled.
  • Requires a custom containerd binary (firecracker control plugin compiles into containerd), a VM runtime shim, an in-VM agent, and a custom root filesystem image builder -- significantly more complex than gVisor
  • Heavier operational burden: must build/maintain a microVM root filesystem image containing runc and the firecracker agent
  • Go-only ecosystem; project maintenance cadence should be evaluated
  • Does not integrate via standard Kubernetes RuntimeClass -- uses its own containerd plugin/API

Integration sketch:

  1. Replace or heavily modify the k3s embedded containerd with firecracker-containerd
  2. Build a microVM root filesystem with runc + firecracker agent
  3. Configure CNI networking (tc-redirect-tap plugin)
  4. Significant changes to cluster image and entrypoint
  5. Would not work for local macOS development without nested virtualization

Option 3: Run the entire k3s cluster inside a VM

Rather than changing the container runtime, run the entire privileged k3s container inside a lightweight VM (QEMU, Cloud Hypervisor, Lima on macOS, etc.).

Pros:

  • Isolates the entire cluster, including the privileged Docker container surface
  • Defense-in-depth: even a container escape stays within the VM
  • No changes to the container runtime or sandbox pod configuration

Cons:

  • Adds startup latency and resource overhead
  • More complex lifecycle management (VM provisioning, networking, volume mounting)
  • Complicates local development, especially on macOS (Lima/colima already provides a Linux VM for Docker)
  • Does not improve isolation between sandbox pods (all still share the same runc runtime inside the VM)

Recommendation

Start with gVisor (Option 1) as it offers the best balance of security improvement, integration simplicity, and compatibility with our existing architecture. It works within our current k3s-in-Docker model, leverages the existing runtime_class_name plumbing, and doesn't require KVM or nested virtualization.

Option 3 (cluster-in-VM) could be pursued independently as an additional layer. Option 2 (Firecracker) is the strongest isolation but has the highest integration cost and significant platform constraints.

Acceptance Criteria

  • Document findings from evaluating each option against our workloads
  • Prototype gVisor integration in the k3s cluster image
  • Test sandbox compatibility (Python 3.12, Node.js, coding agents) under gVisor
  • Measure performance impact (sandbox startup time, steady-state overhead)
  • Decision document with chosen approach and migration plan

Originally by @drew on 2026-02-10T13:44:36.529-08:00

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions