docs: BEP-1051 Kata Containers Agent Backend by kyujin-cho · Pull Request #10520 · lablup/backend.ai

kyujin-cho · 2026-03-26T00:54:24Z

Summary

Propose KataAgent as a third AbstractAgent backend (alongside DockerAgent and KubernetesAgent)
CoCo (Confidential Containers) by default — host is always untrusted
VFIO GPU passthrough with GPUDirect RDMA over InfiniBand
Direct guest-side NFS/Lustre storage mounts (RDMA-preserving)
Segmented BEP with main document + 7 sub-documents

Supersedes #9462 (renamed BEP-1049 → BEP-1051).

Documents

Document	Scope
`BEP-1051-kata-containers-agent.md`	Main: motivation, design overview, kernel creation flow, implementation plan, decision log
`BEP-1051/kata-agent-backend.md`	KataAgent, KataKernel, KataKernelCreationContext classes
`BEP-1051/configuration-deployment.md`	`[kata]` config section, hypervisor selection, host requirements
`BEP-1051/storage-compatibility.md`	Direct guest-side NFS/Lustre mount, intrinsic mount evaluation
`BEP-1051/networking.md`	Calico CNI integration, inter-VM networking, network policy
`BEP-1051/vfio-accelerator-plugin.md`	CUDAVFIOPlugin, RDMAVFIOPlugin, IOMMU group detection, GPUDirect RDMA
`BEP-1051/scheduler-integration.md`	AgentRow.backend column, scaling group policy, VM overhead
`BEP-1051/migration-compatibility.md`	Additive rollout, backward compatibility, guest rootfs requirements

Status

Work in progress — open for early review and feedback.

📚 Documentation preview 📚: https://sorna--10520.org.readthedocs.build/en/10520/

📚 Documentation preview 📚: https://sorna-ko--10520.org.readthedocs.build/ko/10520/

Propose KataAgent as a third container backend (alongside DockerAgent and KubernetesAgent) with VFIO GPU passthrough for hardware-enforced workload isolation in multi-tenant GPU environments. Segmented BEP with 7 sub-documents covering: - KataAgent/KataKernel/KataKernelCreationContext implementation - Configuration and deployment requirements - Storage compatibility via virtio-fs (no new storage interface) - Calico CNI integration for multi-host networking and session isolation - CUDAVFIOPlugin for VFIO-based whole-GPU passthrough - Scheduler integration with AgentRow.backend column - Migration and backward compatibility (all changes additive)

Expand storage-compatibility.md with full evaluation of all 16+ intrinsic mount categories against Kata's VM-based isolation model. Categorize each as KEEP/CHANGE/SKIP/DIFFERENT with rationale based on guest kernel vs shared kernel differences.

…arks Replace the placeholder I/O performance table with concrete benchmark data from Red Hat, Kata Containers, and Proxmox testing. Cover sequential/random throughput, DAX caveats (thrashing), host-side overhead (virtiofsd memory/CPU), and AI/ML workload impact assessment. Change DAX recommendation from "enable by default" to "disable by default" based on DAX thrashing data (Kata #2138).

- configuration-deployment.md: Add "Guest VM Image vs Container Image" section clarifying the two-layer architecture (VM rootfs is mini-OS, container image flows through standard containerd pull) - kata-agent-backend.md: Clarify scratch dirs and resource files are still needed — VM boot disk is read-only shared mini-OS, resource files communicate metadata to kernel runner (not enforcement) - networking.md: Add "Sandbox Model: Multiple Containers Per VM" section explaining kata-agent multi-container management within a single VM and mapping to Backend.AI cluster mode

Add new section to storage-compatibility.md evaluating conventional VM-style per-VM disk cloning (qcow2 CoW, devicemapper, EROFS) as an alternative to the current virtio-fs-for-everything approach. Recommends a hybrid model: block devices for read-only infrastructure (container image, krunner, Python libs) and virtio-fs only for bidirectional data exchange (scratch dirs, vfolders), reducing virtiofsd count from 20-30+ to 2+N_vfolders per VM.

Rewrote kata-agent-backend.md and storage-compatibility.md based on source code analysis of the three-package architecture (agent/kernel/runner). Key corrections: - Agent socket (agent.sock) is skipped entirely for Kata, not replaced with TCP — it is only used by C binaries (jail pid translation) which are not relevant in a VM environment - resource.txt is agent-recovery-only, never read by kernel runner — environ.txt and intrinsic-ports.json are the guest-side config files - ZMQ PUSH/PULL channel is already TCP-based, works across VM boundary without modification - entrypoint.sh needs Kata variant to skip LD_PRELOAD/libbaihook and jail references - krunner binaries strategy: virtio-fs in Phase 1, bake into guest rootfs in Phase 2 Added three-package boot sequence diagram and config file reference table to kata-agent-backend.md. Resolved open questions #7 and #8.

- Fix containerd API: add Sandbox API (sandbox.v1) for multi-container sessions; containers.v1/tasks.v1 alone create one VM per container - Fix virtiofsd process model: one process per sandbox, not per mount; reframe hybrid storage motivation from memory to I/O performance - Clarify VM overhead: 15-60MB is VMM process only, not total per-VM - Fix boot sequence order: get_intrinsic_mounts before prepare_scratch - Fix Dragonball boot time: mark as estimated (no published benchmark) - Fix Mount class location: agent/resources.py:843, not common/types.py - Fix mount_vfolders location: inherited from AbstractKernelCreationContext - Fix Calico standalone policy type from k8s to calico - Add template syntax note for network policy YAML - Resolve Open Question #1 (containerd API) to Decision Log - Add multi-GPU context to Open Question #3 (NVIDIA GPU Operator limit)

…mise, add GPUDirect RDMA and MNNVL analysis

Copilot

Pull request overview

Adds a new segmented BEP (BEP-1051) proposing a Kata Containers–based agent backend for Backend.AI, including VFIO GPU passthrough, Calico-based networking, storage strategy, and scheduler/DB integration, and registers it in the proposals index.

Changes:

Register BEP-1051 in proposals/README.md.
Add main BEP-1051 document plus 7 detailed sub-documents covering backend design, config/deployment, storage, networking, VFIO plugin, scheduler integration, and migration.
Document CoCo/TEE, VFIO/GPUDirect RDMA, and additive rollout strategy.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
proposals/README.md	Adds BEP-1051 to the BEP registry table.
proposals/BEP-1051-kata-containers-agent.md	Main BEP with motivation, design overview, phased plan, and decision log.
proposals/BEP-1051/configuration-deployment.md	Proposes `[kata]` config schema and host prerequisites.
proposals/BEP-1051/kata-agent-backend.md	Details KataAgent/KataKernel/KataKernelCreationContext design and lifecycle.
proposals/BEP-1051/storage-compatibility.md	Analyzes mounts/storage paths and performance considerations.
proposals/BEP-1051/networking.md	Proposes Calico CNI integration and session isolation approach.
proposals/BEP-1051/vfio-accelerator-plugin.md	Proposes a VFIO-based CUDA accelerator plugin design.
proposals/BEP-1051/scheduler-integration.md	Proposes manager/scheduler schema + agent metadata changes for backend tracking.
proposals/BEP-1051/migration-compatibility.md	Outlines additive rollout, backward compatibility, and rollback plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T00:59:46Z

proposals/BEP-1051/storage-compatibility.md

+
+## Summary
+
+Backend.AI's existing `Mount` abstraction (bind mount with source path, target path, and permission) works unchanged for KataAgent. The Kata runtime shim automatically translates host-side bind mounts into virtio-fs shares exposed to the guest VM — no new storage management interface is required. This document details the compatibility layer, identifies mounts that require Kata-specific handling, and analyzes the I/O performance implications.


This doc is internally inconsistent about the VFolder data path: the key decisions say VFolder data does NOT use virtio-fs (direct guest-side mounts), but the Summary states that Kata translates host bind mounts into virtio-fs shares generally. Please reconcile by explicitly separating (a) scratch/config mounts that are host→guest via virtio-fs and (b) VFolder mounts that must be guest-internal bind mounts over guest-mounted NFS/Lustre/WekaFS, and update the described flow accordingly.

Suggested change

Backend.AI's existing `Mount` abstraction (bind mount with source path, target path, and permission) works unchanged for KataAgent. The Kata runtime shim automatically translates host-side bind mounts into virtio-fs shares exposed to the guest VM — no new storage management interface is required. This document details the compatibility layer, identifies mounts that require Kata-specific handling, and analyzes the I/O performance implications.

Backend.AI's existing `Mount` abstraction (bind mount with source path, target path, and permission) works unchanged for KataAgent. For VFolder-backed data, the Storage Proxy resolves paths on network filesystems (for example, NFS/Lustre/WekaFS) that are mounted directly inside the guest VM, and the agent performs bind mounts entirely within the guest; these VFolder data mounts do not use virtio-fs. Separately, scratch/config and other non-performance-critical host-originated directories that must cross the VM boundary are provided via virtio-fs shares configured by the Kata runtime shim. No new storage management interface is required; only the mapping of mount sources to either guest-side network filesystems or virtio-fs shares changes. This document details the compatibility layer, identifies mounts that require Kata-specific handling, and analyzes the I/O performance implications.

Copilot · 2026-03-26T00:59:47Z

proposals/BEP-1051/storage-compatibility.md

+`/home/work` is the user's persistent workspace and vfolder mount point. Bidirectional sync is inherent to virtio-fs — agent writes are immediately visible in the guest, and user writes are immediately visible on the host.
+
+**Timezone files** (`/etc/localtime`, `/etc/timezone`): Docker containers need these because they share the host kernel but not its timezone configuration files. Kata VMs also need them — the guest rootfs ships with UTC as default, but the container should match the host's timezone. Sharing via virtio-fs overrides the guest default.
+
+**VFolder mounts** (`/home/work/{vfolder}`): User storage mounts pass through virtio-fs transparently. Same `Mount(BIND, host_path, target, permission)` spec. See [Direct Storage Access](#direct-storage-access-from-guest-vms) for why virtio-fs is preferred over direct guest mount.


The section describing VFolder mounts says they pass through virtio-fs and references the “Direct Storage Access” section for why virtio-fs is preferred, but later in this same document you state VFolder data does NOT use virtio-fs due to RDMA breakage. Please update this VFolder subsection (and any cross-references) to match the direct guest-side mount architecture.

Suggested change

`/home/work` is the user's persistent workspace and vfolder mount point. Bidirectional sync is inherent to virtio-fs — agent writes are immediately visible in the guest, and user writes are immediately visible on the host.

**Timezone files** (`/etc/localtime`, `/etc/timezone`): Docker containers need these because they share the host kernel but not its timezone configuration files. Kata VMs also need them — the guest rootfs ships with UTC as default, but the container should match the host's timezone. Sharing via virtio-fs overrides the guest default.

**VFolder mounts** (`/home/work/{vfolder}`): User storage mounts pass through virtio-fs transparently. Same `Mount(BIND, host_path, target, permission)` spec. See [Direct Storage Access](#direct-storage-access-from-guest-vms) for why virtio-fs is preferred over direct guest mount.

`/home/work` is the user's persistent workspace and VFolder mount point. In the Kata design this path is provided by a **direct guest-side mount** of the underlying storage system (for example, NFS or Lustre) into the VM, so both the host-side VFolder manager and the guest see the same backing storage without going through virtio-fs.

**Timezone files** (`/etc/localtime`, `/etc/timezone`): Docker containers need these because they share the host kernel but not its timezone configuration files. Kata VMs also need them — the guest rootfs ships with UTC as default, but the container should match the host's timezone. Sharing via virtio-fs overrides the guest default.

**VFolder mounts** (`/home/work/{vfolder}`): User storage is **not** exposed via virtio-fs, because that breaks RDMA and advanced filesystem features on NFS/Lustre. Instead, the storage backend is mounted natively inside the guest using the NFS/Lustre kernel client, and containers bind-mount VFolder subdirectories from this guest path. The same `Mount(BIND, host_path, target, permission)` spec is used at the container level, but `host_path` now refers to the guest-visible `/home/work/...` path. See [Direct Storage Access](#direct-storage-access-from-guest-vms) for why direct guest mounts are preferred over virtio-fs for VFolder data.

Copilot · 2026-03-26T00:59:47Z

proposals/BEP-1051/storage-compatibility.md

+**Use virtio-fs for all storage backends.** The double-hop overhead is acceptable because:
+
+- virtio-fs + DAX provides 90-98% native read performance — the dominant I/O pattern for AI/ML workloads (reading training data)
+- The host page cache is shared with the guest via DAX, effectively giving the guest "free" caching
+- The architecture remains simple — one storage path for all backends, no guest-side storage configuration
+- No vendor offers built-in Kata/Firecracker integration that would justify a separate code path
+
+Direct guest mount (NFS or native client) could be revisited as a future optimization if specific workloads demonstrate that the virtio-fs write overhead (2-5x for random writes with DAX) is a bottleneck. This would require:
+- Custom guest kernel with NFS/client modules
+- A new mount type or annotation to signal "direct guest mount" for specific vfolders
+- Guest-side mount credential provisioning via kata-agent
+
+This is **not proposed for the initial implementation**.
+


The “Recommendation” currently concludes “Use virtio-fs for all storage backends”, which contradicts the key decision at the top of this document (and the later implementation notes) that VFolder data must be mounted directly inside the guest to preserve RDMA paths. Please revise this recommendation section to align with the direct guest-side mount approach (and scope virtio-fs to scratch/config only).

Suggested change

**Use virtio-fs for all storage backends.** The double-hop overhead is acceptable because:

- virtio-fs + DAX provides 90-98% native read performance — the dominant I/O pattern for AI/ML workloads (reading training data)

- The host page cache is shared with the guest via DAX, effectively giving the guest "free" caching

- The architecture remains simple — one storage path for all backends, no guest-side storage configuration

- No vendor offers built-in Kata/Firecracker integration that would justify a separate code path

Direct guest mount (NFS or native client) could be revisited as a future optimization if specific workloads demonstrate that the virtio-fs write overhead (2-5x for random writes with DAX) is a bottleneck. This would require:

- Custom guest kernel with NFS/client modules

- A new mount type or annotation to signal "direct guest mount" for specific vfolders

- Guest-side mount credential provisioning via kata-agent

This is **not proposed for the initial implementation**.

**Use direct guest-side mounts for VFolder data; reserve virtio-fs for scratch/config-only directories.** This aligns with the key decision that RDMA-capable storage paths must be preserved end-to-end:

- VFolder data is mounted **inside the guest VM** via the native NFS/Lustre (or other vendor) client, so RDMA and storage-specific semantics are preserved.

- Containers see VFolder subdirectories via ordinary bind mounts from the guest filesystem — the same model we use with Docker today.

- virtio-fs is retained only for non-performance-critical, host-originated paths (e.g., scratch space, config directories), where the double-hop overhead is acceptable and simplifies host/guest sharing.

This implies that a "virtio-fs for all storage backends" architecture is **not compatible** with the RDMA requirement for VFolders and is therefore **not part of the initial implementation**. A unified virtio-fs-only mode could be reconsidered in environments that do not require RDMA or direct client semantics, but it is explicitly out of scope for BEP-1051.

Copilot · 2026-03-26T00:59:47Z

proposals/BEP-1051/kata-agent-backend.md

+    async def prepare_krunner_env(self, local_config):
+        # Kata approach: krunner binaries are shared into the guest via
+        # virtio-fs from a host directory, or baked into the guest rootfs.
+        # No Docker volumes needed.
+        return await prepare_krunner_env_kata(local_config)


This section says krunner binaries are shared into the guest via virtio-fs from a host directory. The master BEP states CoCo-by-default and requires executables to be baked into the attested guest rootfs (host is untrusted). Please align this doc with the CoCo trust model (remove/avoid host→guest executable sharing and describe the attested delivery mechanism).

Copilot · 2026-03-26T00:59:48Z

proposals/BEP-1051/configuration-deployment.md

+### Pydantic Config Model
+
+```python
+class KataConfig(BaseConfigSchema):
+    hypervisor: Literal["cloud-hypervisor", "qemu", "dragonball"] = "cloud-hypervisor"
+
+    # VM defaults
+    default_vcpus: int = 2
+    default_memory_mb: int = 2048
+    vm_overhead_mb: int = 64  # VMM process + guest kernel + kata-agent (on top of guest memory)
+
+    # Guest image
+    kernel_path: Path = Path("/opt/kata/share/kata-containers/vmlinux.container")
+    initrd_path: Path | None = None
+    rootfs_path: Path = Path("/opt/kata/share/kata-containers/kata-containers.img")
+
+    # Storage
+    shared_fs: Literal["virtio-fs", "virtio-9p"] = "virtio-fs"
+    virtiofsd_path: Path = Path("/opt/kata/libexec/virtiofsd")
+    virtio_fs_cache_size: int = 0
+
+    # Networking
+    network_model: Literal["tcfilter", "macvtap"] = "tcfilter"
+
+    # VFIO
+    enable_iommu: bool = True
+    hotplug_vfio: Literal["root-port", "bridge-port", "no-port"] = "root-port"
+
+    # Containerd
+    containerd_socket: Path = Path("/run/containerd/containerd.sock")
+    kata_runtime_class: str = "kata"
+
+    # Confidential computing (Phase 4)
+    confidential_guest: bool = False
+    guest_attestation: Literal["tdx", "sev-snp", ""] = ""
+```


The TOML examples use hyphenated keys (e.g., default-vcpus, vm-overhead-mb), but the Pydantic model snippet doesn’t show the AliasChoices/serialization_alias pattern used elsewhere in AgentUnifiedConfig to support those hyphenated names. Please update the snippet to include the appropriate Field validation_alias/serialization_alias so the example keys actually parse and sample generation emits consistent key names.

Copilot · 2026-03-26T00:59:48Z

proposals/BEP-1051/vfio-accelerator-plugin.md

+- The guest VM has the `nvidia` driver loaded and NVML is functional inside the guest
+- Metrics must be collected via the kata-agent running inside the guest, forwarded over VSOCK
+- The agent can periodically query guest-side `nvidia-smi` or NVML via a metrics endpoint exposed by kata-agent
+
+```python
+async def gather_container_measures(
+    self, stat_ctx, container_ids,
+) -> Sequence[ContainerMeasurement]:
+    # Query guest-side nvidia-smi via containerd exec
+    for container_id in container_ids:
+        result = await self._containerd.exec_in_container(
+            container_id,
+            ["nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total",
+             "--format=csv,noheader,nounits"],
+        )
+        # Parse CSV output into ContainerMeasurement objects


This plugin doc proposes collecting container-level GPU metrics by running nvidia-smi via containerd exec in the guest. The master BEP states CoCo-by-default and that ExecProcessRequest is blocked by the kata-agent policy, requiring metrics via in-guest exporters (DCGM/Node Exporter) over the network instead. Please update this section so the metrics plan is compatible with the CoCo policy constraints.

Suggested change

- The guest VM has the `nvidia` driver loaded and NVML is functional inside the guest

- Metrics must be collected via the kata-agent running inside the guest, forwarded over VSOCK

- The agent can periodically query guest-side `nvidia-smi` or NVML via a metrics endpoint exposed by kata-agent

```python

async def gather_container_measures(

self, stat_ctx, container_ids,

) -> Sequence[ContainerMeasurement]:

# Query guest-side nvidia-smi via containerd exec

for container_id in container_ids:

result = await self._containerd.exec_in_container(

container_id,

["nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total",

"--format=csv,noheader,nounits"],

)

# Parse CSV output into ContainerMeasurement objects

- The guest VM has the `nvidia` driver loaded and NVML is functional inside the guest.

- Metrics MUST be exposed by in-guest exporters (e.g., DCGM exporter, Node Exporter) running inside the VM and scraped over the network (e.g., Prometheus HTTP endpoints).

- CoCo-by-default: kata-agent `ExecProcessRequest` is blocked by policy, so the plugin MUST NOT rely on `containerd exec` or any kata-agent exec-based mechanism to run `nvidia-smi` inside the guest.

- The host-side plugin is responsible for scraping the in-guest metrics endpoints (via the appropriate network path for the VM) and aggregating them per container or per sandbox using labels exported by the in-guest agent (e.g., `pod`, `container`, `vm_id`).

```python

async def gather_container_measures(

self, stat_ctx, container_ids,

) -> Sequence[ContainerMeasurement]:

"""

Collect GPU metrics for containers by scraping in-guest exporters.

The confidential VM runs DCGM exporter / Node Exporter and exposes

a metrics endpoint (e.g., Prometheus text format) over the network.

This function MUST NOT attempt to exec `nvidia-smi` inside the guest,

because kata-agent ExecProcessRequest is blocked by CoCo policy.

"""

measurements: list[ContainerMeasurement] = []

# Example: resolve the per-VM metrics endpoint from stat_ctx

vm_endpoint = stat_ctx.get_vm_metrics_endpoint()

# Scrape once per VM, then attribute metrics to containers by label

metrics = await self._metrics_client.scrape(vm_endpoint)

for container_id in container_ids:

# Look up GPU metrics for this container from the scraped data,

# using container/pod labels exported by the in-guest agent.

gpu_stats = self._extract_gpu_stats_for_container(metrics, container_id)

if gpu_stats is None:

continue

measurements.append(

ContainerMeasurement(

container_id=container_id,

gpu_utilization=gpu_stats.utilization,

gpu_memory_used=gpu_stats.mem_used,

gpu_memory_total=gpu_stats.mem_total,

)

)

return measurements

Copilot · 2026-03-26T00:59:48Z

proposals/BEP-1051/kata-agent-backend.md

+
+## Summary
+
+KataAgent is the third `AbstractAgent` implementation that manages containers inside lightweight VMs via Kata Containers 3.x. It communicates with containerd's gRPC API to create containers using the Kata runtime shim, replacing Docker API calls with containerd CRI operations.


The Summary says KataAgent uses containerd’s gRPC API but also calls this “containerd CRI operations”. Later in this document you note CRI is Kubernetes/kubelet-specific and shouldn’t be used directly. Please reword the summary to refer to the containerd client API (containers/tasks/sandbox services) rather than CRI to avoid confusion.

Suggested change

KataAgent is the third `AbstractAgent` implementation that manages containers inside lightweight VMs via Kata Containers 3.x. It communicates with containerd's gRPC API to create containers using the Kata runtime shim, replacing Docker API calls with containerd CRI operations.

KataAgent is the third `AbstractAgent` implementation that manages containers inside lightweight VMs via Kata Containers 3.x. It communicates with containerd's gRPC API to create containers using the Kata runtime shim, replacing Docker API calls with containerd's native client API (containers, tasks, and sandbox services).

Copilot · 2026-03-26T00:59:48Z

proposals/BEP-1051/kata-agent-backend.md

+    # Shared from host via virtio-fs as individual file mounts.
+    for binary_name, target_path in KRUNNER_BINARY_MAP.items():
+        resource_spec.mounts.append(
+            Mount(MountTypes.BIND, self._krunner_dir / binary_name,
+                  Path(f"/opt/kernel/{target_path}"), MountPermission.READ_ONLY)
+        )


The mount_krunner() example still describes binding individual krunner binaries from a host directory via virtio-fs. Under the CoCo-by-default trust model described in the master BEP, the host must not be a source of executables. Please update this example to reflect binaries being present in the attested guest rootfs (and limit virtio-fs sharing to non-executable config/data).

Suggested change

# Shared from host via virtio-fs as individual file mounts.

for binary_name, target_path in KRUNNER_BINARY_MAP.items():

resource_spec.mounts.append(

Mount(MountTypes.BIND, self._krunner_dir / binary_name,

Path(f"/opt/kernel/{target_path}"), MountPermission.READ_ONLY)

)

# All krunner executables are provided by the attested guest rootfs.

# Under the CoCo-by-default trust model, the host must not be a source

# of executables; do not bind krunner binaries from host via virtio-fs.

# Only share non-executable config/data via virtio-fs (handled elsewhere);

# do not introduce new executable mounts from the host here.

Copilot · 2026-03-26T00:59:49Z

proposals/BEP-1051/configuration-deployment.md

+### Kata Configuration Section
+
+```toml
+[kata]
+# --- Hypervisor ---
+hypervisor = "cloud-hypervisor"  # "cloud-hypervisor" | "qemu" | "dragonball"
+
+# --- VM Defaults ---
+default-vcpus = 2               # Initial vCPUs per VM (hot-plugged as needed)
+default-memory-mb = 2048        # Initial memory per VM in MB
+vm-overhead-mb = 64             # Per-VM VMM process + guest kernel + kata-agent overhead (MB), deducted from host capacity
+
+# --- Guest VM Image (NOT the container image — see note below) ---
+kernel-path = "/opt/kata/share/kata-containers/vmlinux.container"
+initrd-path = ""                # Empty = use rootfs image instead of initrd
+rootfs-path = "/opt/kata/share/kata-containers/kata-containers.img"
+
+# --- Storage ---
+shared-fs = "virtio-fs"         # "virtio-fs" | "virtio-9p" (9p deprecated)
+virtiofsd-path = "/opt/kata/libexec/virtiofsd"
+virtio-fs-cache-size = 0        # DAX window in MB; 0 = disabled (recommended default)
+
+# --- Networking ---
+network-model = "tcfilter"      # "tcfilter" | "macvtap"
+
+# --- VFIO ---
+enable-iommu = true
+hotplug-vfio = "root-port"      # "root-port" | "bridge-port" | "no-port"
+
+# --- Containerd ---
+containerd-socket = "/run/containerd/containerd.sock"
+kata-runtime-class = "kata"     # RuntimeClass name registered in containerd
+
+# --- Confidential Computing (Phase 4) ---
+confidential-guest = false
+guest-attestation = ""          # "tdx" | "sev-snp" | ""
+```
+
+### Guest VM Image vs Container Image
+
+Kata Containers uses **two separate filesystem layers** that must not be confused:
+
+1. **Guest VM rootfs** (`rootfs-path` above): A minimal mini-OS image containing only the kata-agent, systemd, and essential utilities. This is the VM's boot disk — shared across all VMs, read-only, and mounted via DAX on a `/dev/pmem*` device inside the guest. It is **not** the user's container image. This is an infrastructure-level asset analogous to a VM template.
+
+2. **Container image** (e.g., `cr.backend.ai/stable/python-tensorflow:2.15-py312-cuda12.3`): The user-selected OCI image that Backend.AI's image management system resolves. containerd pulls this on the host (same as Docker), and the Kata shim mounts it into the guest via virtio-fs or block device passthrough. The kata-agent inside the guest uses it as the container's root filesystem.
+
+```
+Host: containerd pulls OCI image (e.g., tensorflow:latest)
+  │
+  ├─ Kata shim detects image storage backend:
+  │   ├─ overlayfs snapshotter → share via virtio-fs
+  │   └─ devicemapper snapshotter → attach as virtio-blk block device
+  │
+  └─ Guest VM boots from kata-containers.img (mini-OS)
+       └─ kata-agent mounts container rootfs inside the guest
+           └─ Container process runs on the OCI image filesystem
+```
+
+**No changes to Backend.AI's image management are needed.** The image registry, image selection, and containerd pull flow are identical to Docker. Only the last mile differs — Kata transports the image into the guest VM instead of using a host-kernel bind mount.
+
+For **confidential computing** (Phase 4), images are pulled and decrypted **inside the guest** using `image-rs` (the host is untrusted and must not see image contents). This is not the standard flow and requires additional CoCo components.
+


This doc is written as if non-CoCo Kata is the baseline (host-side image pull, confidential-guest=false, CoCo as “Phase 4”), but the master BEP states CoCo-by-default and guest-side image pull (host must not see image contents). Please reconcile the configuration defaults and the “Guest VM Image vs Container Image” flow so the baseline matches the CoCo trust model (or clearly split the doc into CoCo vs non-CoCo modes).

Copilot · 2026-03-26T00:59:49Z

proposals/BEP-1051-kata-containers-agent.md

+|----------|-------------|-------|
+| [Configuration & Deployment](BEP-1051/configuration-deployment.md) | `[kata]` config section, hypervisor selection, host requirements | 1 |
+| [KataAgent Backend](BEP-1051/kata-agent-backend.md) | KataAgent, KataKernel, KataKernelCreationContext | 1 |
+| [Storage Compatibility](BEP-1051/storage-compatibility.md) | virtio-fs mount translation, lxcfs/socket exceptions, I/O analysis | 1 |


The main doc’s index describes the storage sub-document as “virtio-fs mount translation…”, but this same BEP’s key constraints/design overview emphasize direct guest-side NFS/Lustre/WekaFS mounts for VFolder data (virtio-fs only for scratch/config). Please update the index description so it matches the current storage architecture and avoids sending readers to an outdated premise.

Suggested change

| [Storage Compatibility](BEP-1051/storage-compatibility.md) | virtio-fs mount translation, lxcfs/socket exceptions, I/O analysis | 1 |

| [Storage Compatibility](BEP-1051/storage-compatibility.md) | direct guest-side NFS/Lustre/WekaFS mounts for VFolders, virtio-fs for scratch/config, lxcfs/socket exceptions, I/O analysis | 1 |

… mounts, and DCGM metrics decisions

…all sub-documents

…l sub-documents

…-fs; vfolder mounts are guest-internal only

…d Docker mounts into concise list

…ale references

kyujin-cho added 9 commits February 27, 2026 13:50

update(BEP-1049): resolve all open questions with CoCo-by-default pre…

30bfa9d

…mise, add GPUDirect RDMA and MNNVL analysis

refactor(BEP): rename BEP-1049 to BEP-1051

883a10c

Copilot AI review requested due to automatic review settings March 26, 2026 00:54

github-actions bot assigned kyujin-cho Mar 26, 2026

Copilot started reviewing on behalf of kyujin-cho March 26, 2026 00:54 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Merge branch 'main' into docs/bep-1051-kata-containers-agent

70529ce

github-actions bot added the size:XL 500~ LoC label Mar 26, 2026

kyujin-cho added 9 commits March 26, 2026 15:45

this will be merged from main later on

2e9a0d0

Merge branch 'main' into docs/bep-1051-kata-containers-agent

9007917

fix(BEP-1051): align sub-documents with CoCo-by-default, direct guest…

1e241f1

… mounts, and DCGM metrics decisions

fix(BEP-1051): fix mermaid sequence diagram syntax for GitHub rendering

05c454a

update(BEP-1051): supplement technical implementation details across …

39db6ed

…all sub-documents

refactor(BEP-1051): replace ASCII art diagrams with mermaid across al…

7424a68

…l sub-documents

fix(BEP-1051): clarify /home/work is guest-disk directory, not virtio…

88c67d7

…-fs; vfolder mounts are guest-internal only

fix(BEP-1051): trim verbose agent socket sections; consolidate skippe…

4210a5c

…d Docker mounts into concise list

fix(BEP-1051): fix docker-creds.json and gather_container_measures st…

830351a

…ale references

kyujin-cho changed the title ~~docs: [WIP] BEP-1051 Kata Containers Agent Backend~~ docs: BEP-1051 Kata Containers Agent Backend Mar 26, 2026

kyujin-cho requested a review from achimnol March 26, 2026 09:40

kyujin-cho requested a review from HyeockJinKim March 26, 2026 09:40

add missing files

f3fa149

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: BEP-1051 Kata Containers Agent Backend#10520

docs: BEP-1051 Kata Containers Agent Backend#10520
kyujin-cho wants to merge 20 commits intomainfrom
docs/bep-1051-kata-containers-agent

kyujin-cho commented Mar 26, 2026 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## Summary

		Backend.AI's existing `Mount` abstraction (bind mount with source path, target path, and permission) works unchanged for KataAgent. The Kata runtime shim automatically translates host-side bind mounts into virtio-fs shares exposed to the guest VM — no new storage management interface is required. This document details the compatibility layer, identifies mounts that require Kata-specific handling, and analyzes the I/O performance implications.

-- The guest VM has the `nvidia` driver loaded and NVML is functional inside the guest
-- Metrics must be collected via the kata-agent running inside the guest, forwarded over VSOCK
-- The agent can periodically query guest-side `nvidia-smi` or NVML via a metrics endpoint exposed by kata-agent
-```python
-async def gather_container_measures(
-    self, stat_ctx, container_ids,
-) -> Sequence[ContainerMeasurement]:
-    # Query guest-side nvidia-smi via containerd exec
-    for container_id in container_ids:
-        result = await self._containerd.exec_in_container(
-            container_id,
-            ["nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total",
-             "--format=csv,noheader,nounits"],
-        )
-        # Parse CSV output into ContainerMeasurement objects
+- The guest VM has the `nvidia` driver loaded and NVML is functional inside the guest.
+- Metrics MUST be exposed by in-guest exporters (e.g., DCGM exporter, Node Exporter) running inside the VM and scraped over the network (e.g., Prometheus HTTP endpoints).
+- CoCo-by-default: kata-agent `ExecProcessRequest` is blocked by policy, so the plugin MUST NOT rely on `containerd exec` or any kata-agent exec-based mechanism to run `nvidia-smi` inside the guest.
+- The host-side plugin is responsible for scraping the in-guest metrics endpoints (via the appropriate network path for the VM) and aggregating them per container or per sandbox using labels exported by the in-guest agent (e.g., `pod`, `container`, `vm_id`).
+```python
+async def gather_container_measures(
+    self, stat_ctx, container_ids,
+) -> Sequence[ContainerMeasurement]:
+    """
+    Collect GPU metrics for containers by scraping in-guest exporters.
+    The confidential VM runs DCGM exporter / Node Exporter and exposes
+    a metrics endpoint (e.g., Prometheus text format) over the network.
+    This function MUST NOT attempt to exec `nvidia-smi` inside the guest,
+    because kata-agent ExecProcessRequest is blocked by CoCo policy.
+    """
+    measurements: list[ContainerMeasurement] = []
+    # Example: resolve the per-VM metrics endpoint from stat_ctx
+    vm_endpoint = stat_ctx.get_vm_metrics_endpoint()
+    # Scrape once per VM, then attribute metrics to containers by label
+    metrics = await self._metrics_client.scrape(vm_endpoint)
+    for container_id in container_ids:
+        # Look up GPU metrics for this container from the scraped data,
+        # using container/pod labels exported by the in-guest agent.
+        gpu_stats = self._extract_gpu_stats_for_container(metrics, container_id)
+        if gpu_stats is None:
+            continue
+        measurements.append(
+            ContainerMeasurement(
+                container_id=container_id,
+                gpu_utilization=gpu_stats.utilization,
+                gpu_memory_used=gpu_stats.mem_used,
+                gpu_memory_total=gpu_stats.mem_total,
+            )
+        )
+    return measurements


		## Summary

		KataAgent is the third `AbstractAgent` implementation that manages containers inside lightweight VMs via Kata Containers 3.x. It communicates with containerd's gRPC API to create containers using the Kata runtime shim, replacing Docker API calls with containerd CRI operations.

	\| [Storage Compatibility](BEP-1051/storage-compatibility.md) \| virtio-fs mount translation, lxcfs/socket exceptions, I/O analysis \| 1 \|
	\| [Storage Compatibility](BEP-1051/storage-compatibility.md) \| direct guest-side NFS/Lustre/WekaFS mounts for VFolders, virtio-fs for scratch/config, lxcfs/socket exceptions, I/O analysis \| 1 \|

Conversation

kyujin-cho commented Mar 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Documents

Status

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kyujin-cho commented Mar 26, 2026 •

edited by github-actions bot

Loading