-
Notifications
You must be signed in to change notification settings - Fork 293
Description
Agent Diagnostic
Investigated by: Claude Code (debug-openshell-cluster skill)
Cluster: openshell · https://127.0.0.1:8080 · v0.0.7
Sandbox: strong-treefrog · image ghcr.io/nvidia/openshell-community/sandboxes/base@sha256:97a6a668...
Cluster is fully healthy — this is not a cluster startup issue.
openshell status → Connected (v0.0.7)
node status → Ready (k3s v1.35.2+k3s1)
openshell-0 → Running, 0 restarts
Root cause: the Claude Code binary aborts on ARM64 kernels with 64K page size.
| Property | Value |
|---|---|
| Kernel | 6.14.0-1013-nvidia-64k |
| Architecture | aarch64 |
getconf PAGE_SIZE |
65536 (64K) |
/usr/local/bin/claude |
ELF ARM64 standalone binary, 222MB |
| Bundled runtime | Bun v1.3.1 (confirmed via grep -boa "Bun v") |
| Crash signal | SIGABRT (exit 134, not SIGSEGV) |
| Seccomp | Disabled — not the cause |
| Node.js v22.22.1 | Works fine in same sandbox |
Bun's JavaScript engine (JavaScriptCore) performs memory layout operations assuming ≤16K page alignment. On a 64K page kernel, these assumptions are violated and JSC calls abort() before any user code executes. node itself is unaffected — the NPM-based install path uses Node.js and avoids this entirely.
Commands run:
openshell status
openshell doctor exec -- kubectl get pods -A -o wide
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- uname -a
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- getconf PAGE_SIZE
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- /usr/local/bin/claude --version
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- sh -c 'cat /proc/self/status | grep -E "Seccomp|NoNewPrivs|CapEff"'
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- ldd /usr/local/bin/claude
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- sh -c 'ls -lh /usr/local/bin/claude'
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- sh -c 'grep -boa "Bun v" /usr/local/bin/claude'
openshell doctor exec -- kubectl -n openshell exec strong-treefrog -- node -e "process.exit(0)" # ✓ worksDescription
The default Claude Code installer produces a binary that does not run correctly on GB200 nodes.
$ openshell sandbox create -- claude
...
Created sandbox: strong-treefrog
$ openshell sandbox connect strong-treefrog
sandbox@strong-treefrog:~$ claude
Aborted (core dumped)
This is a known issue with the Claude Code installer on these machines. The workaround is to install via NPM instead:
npm install -g @anthropic-ai/claude-code
OpenShell should consider using the NPM-based install path for Claude on
affected systems. It is unclear whether this is specific to Grace-based
systems (GB200) or affects ARM64 in general — worth testing on other
aarch64 machines to narrow it down.
Reproduction Steps
- On a GB200 node, create a Claude sandbox:
openshell sandbox create -- claude - Connect to the sandbox:
openshell sandbox connect <name> - Run
claude - Observe:
Aborted (core dumped)
Environment
- Host: GB200 (Grace Blackwell), aarch64
- OpenShell sandbox with Claude provider
Logs
Agent-First Checklist
- I pointed my agent at the repo and had it investigate this issue
- I loaded relevant skills (e.g.,
debug-openshell-cluster,debug-inference,openshell-cli) - My agent could not resolve this — the diagnostic above explains why