Skip to content

bug: Claude Code crashes on GB200 (ARM64 64K page kernel) #400

@klueska

Description

@klueska

Agent Diagnostic

Investigated by: Claude Code (debug-openshell-cluster skill)
Cluster: openshell · https://127.0.0.1:8080 · v0.0.7
Sandbox: strong-treefrog · image ghcr.io/nvidia/openshell-community/sandboxes/base@sha256:97a6a668...

Cluster is fully healthy — this is not a cluster startup issue.

openshell status        → Connected (v0.0.7)
node status             → Ready (k3s v1.35.2+k3s1)
openshell-0             → Running, 0 restarts

Root cause: the Claude Code binary aborts on ARM64 kernels with 64K page size.

Property Value
Kernel 6.14.0-1013-nvidia-64k
Architecture aarch64
getconf PAGE_SIZE 65536 (64K)
/usr/local/bin/claude ELF ARM64 standalone binary, 222MB
Bundled runtime Bun v1.3.1 (confirmed via grep -boa "Bun v")
Crash signal SIGABRT (exit 134, not SIGSEGV)
Seccomp Disabled — not the cause
Node.js v22.22.1 Works fine in same sandbox

Bun's JavaScript engine (JavaScriptCore) performs memory layout operations assuming ≤16K page alignment. On a 64K page kernel, these assumptions are violated and JSC calls abort() before any user code executes. node itself is unaffected — the NPM-based install path uses Node.js and avoids this entirely.

Commands run:

openshell status
openshell doctor exec -- kubectl get pods -A -o wide
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- uname -a
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- getconf PAGE_SIZE
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- /usr/local/bin/claude --version
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- sh -c 'cat /proc/self/status | grep -E "Seccomp|NoNewPrivs|CapEff"'
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- ldd /usr/local/bin/claude
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- sh -c 'ls -lh /usr/local/bin/claude'
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- sh -c 'grep -boa "Bun v" /usr/local/bin/claude'
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- node -e "process.exit(0)"  # ✓ works

Description

The default Claude Code installer produces a binary that does not run correctly on GB200 nodes.

$ openshell sandbox create -- claude
...
Created sandbox: strong-treefrog
$ openshell sandbox connect strong-treefrog
sandbox@strong-treefrog:~$ claude
Aborted (core dumped)

This is a known issue with the Claude Code installer on these machines. The workaround is to install via NPM instead:

npm install -g @anthropic-ai/claude-code

OpenShell should consider using the NPM-based install path for Claude on
affected systems. It is unclear whether this is specific to Grace-based
systems (GB200) or affects ARM64 in general — worth testing on other
aarch64 machines to narrow it down.

Reproduction Steps

  1. On a GB200 node, create a Claude sandbox: openshell sandbox create -- claude
  2. Connect to the sandbox: openshell sandbox connect <name>
  3. Run claude
  4. Observe: Aborted (core dumped)

Environment

  • Host: GB200 (Grace Blackwell), aarch64
  • OpenShell sandbox with Claude provider

Logs

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions