-
Notifications
You must be signed in to change notification settings - Fork 350
Description
Problem Statement
Running openshell sandbox create -- claude on macOS Tahoe (Apple Silicon) with Docker on OrbStack results in persistent pod CrashLoopBackOff during cluster startup. The root cause is a flannel CNI initialization race combined with an overly aggressive liveness probe that kills pods before the network layer is ready. A secondary issue is that the server logs misleading "TLS handshake failed" errors from the tcpSocket liveness probe, making it appear that the probe itself is misconfigured.
Technical Context
The K3s single-node cluster starts and immediately begins scheduling pods (CoreDNS, agent-sandbox-controller, metrics-server). However, Flannel CNI — which runs as a DaemonSet inside K3s — takes ~19 seconds to initialize and write /run/flannel/subnet.env. All pods scheduled before that file exists fail with failed to load flannel 'subnet.env' file: no such file or directory. The existing entrypoint already creates /run/flannel directory (acknowledging the race in a comment), but this is insufficient because the directory alone doesn't prevent the error — the subnet.env file must exist.
Compounding this, the liveness probe on the openshell StatefulSet has initialDelaySeconds: 2 and failureThreshold: 3 with periodSeconds: 5, meaning the pod gets killed after only 17 seconds (2 + 3×5) of not responding — just before flannel becomes ready at ~19 seconds. The pod restarts, enters CrashLoopBackOff with exponential backoff, and the cycle continues.
The reported "tls handshake eof" errors are actually expected behavior: the tcpSocket probe connects at TCP level, the server accepts the connection and attempts a TLS handshake, the probe disconnects (having confirmed the port is open), and the server logs a TLS error. The probe itself succeeds — but the error log is misleading.
Affected Components
| Component | Key Files | Role |
|---|---|---|
| Helm chart | deploy/helm/openshell/templates/statefulset.yaml, deploy/helm/openshell/values.yaml |
Defines liveness/readiness probes for the openshell StatefulSet |
| Server | crates/openshell-server/src/lib.rs, crates/openshell-server/src/http.rs |
TCP+TLS accept loop and health endpoints |
| Cluster bootstrap | deploy/docker/cluster-entrypoint.sh, deploy/docker/Dockerfile.cluster |
K3s startup sequence and flannel directory pre-creation |
| Health check | deploy/docker/cluster-healthcheck.sh |
Docker HEALTHCHECK for the cluster container |
| CLI bootstrap | crates/openshell-bootstrap/src/runtime.rs |
wait_for_gateway_ready polling logic |
Technical Investigation
Architecture Overview
The cluster startup sequence is:
cluster-entrypoint.shruns, creates/run/flanneldirectory (line 473-483), then execs K3s (line 506)- K3s starts the API server, kubelet, and scheduler
- Kubelet immediately tries to schedule pending pods (CoreDNS, local-path-provisioner, openshell StatefulSet)
- Flannel DaemonSet pod is created but also needs network to start (chicken-and-egg, resolved by host networking)
- Flannel initializes after ~19 seconds, writes
/run/flannel/subnet.env - Pods scheduled before step 5 fail with missing subnet.env
The openshell server (crates/openshell-server/src/lib.rs:167-211) listens with TLS via a TcpListener + TlsAcceptor. The TCP accept happens first, then TLS handshake in a spawned task. The tcpSocket K8s probe only checks TCP connectivity, so it succeeds at the TCP level even with TLS enabled.
Code References
| Location | Description |
|---|---|
deploy/helm/openshell/templates/statefulset.yaml:113-119 |
Liveness probe: tcpSocket on port grpc, initialDelaySeconds: 2, failureThreshold: 3, periodSeconds: 5 |
deploy/helm/openshell/templates/statefulset.yaml:120-126 |
Readiness probe: same tcpSocket pattern |
deploy/helm/openshell/values.yaml:45-55 |
Default probe timing values |
crates/openshell-server/src/lib.rs:180-211 |
TLS accept loop — accepts TCP, then attempts TLS handshake in spawned task |
crates/openshell-server/src/lib.rs:201 |
error!(error = %e, client = %addr, "TLS handshake failed") — logs probe disconnects as errors |
crates/openshell-server/src/http.rs:21-46 |
Health endpoints (/health, /healthz, /readyz) |
deploy/docker/cluster-entrypoint.sh:473-483 |
Pre-creates /run/flannel dir with comment acknowledging the race condition |
deploy/docker/cluster-entrypoint.sh:506 |
exec /bin/k3s server ... — starts K3s |
deploy/docker/Dockerfile.cluster:271-272 |
Docker HEALTHCHECK with --start-period=20s |
deploy/docker/cluster-healthcheck.sh:35-51 |
Checks K8s API + StatefulSet readiness |
crates/openshell-bootstrap/src/runtime.rs:44-226 |
wait_for_gateway_ready: 180 attempts × 2s = 360s budget |
Current Behavior
- K3s starts, kubelet schedules pods immediately
- Flannel not ready → all pods fail with missing
subnet.env→ CrashLoopBackOff - Flannel becomes ready at ~19s
- Liveness probe kills openshell pod at ~17s (before flannel is ready)
- Pod restarts, enters CrashLoopBackOff with exponential backoff
- Server logs "TLS handshake failed" for every tcpSocket probe connection (noise)
- Cluster may eventually recover within the 360s CLI timeout, but startup is unreliable
What Would Need to Change
Fix 1 — Add startupProbe to StatefulSet (primary fix):
The Helm chart template needs a startupProbe added to the container spec. This is the standard Kubernetes pattern for pods with slow startup. With a startupProbe, the livenessProbe is disabled until the startup probe succeeds, preventing premature kills during the flannel initialization window.
Fix 2 — Downgrade TLS handshake error log level:
In crates/openshell-server/src/lib.rs:201, the error! macro should be changed to debug! (or conditionally downgraded) when the TLS error indicates an immediate EOF — which is the tcpSocket probe pattern. This reduces misleading log noise.
Alternative Approaches Considered
| Approach | Trade-off |
|---|---|
Increase initialDelaySeconds on liveness probe |
Delays detection of real failures; doesn't decouple startup from steady-state |
| Wait for flannel in entrypoint before exec'ing k3s | Impossible — flannel runs inside k3s |
Use --flannel-backend=none + external flannel |
High complexity, maintenance burden |
| Switch to gRPC health probe | More correct but requires implementing gRPC health service |
Add exec probe with curl -k https://localhost:8080/healthz |
Validates full stack but requires curl in container |
Patterns to Follow
The existing probe definitions in statefulset.yaml:113-126 follow the standard Helm chart pattern with values from values.yaml. The startupProbe should follow the same pattern, using tcpSocket on the grpc port with configurable values.
Proposed Approach
Add a startupProbe to the StatefulSet template with generous timing (e.g., periodSeconds: 2, failureThreshold: 30 = 60s startup window). This decouples slow-start tolerance from the steady-state liveness check. Additionally, downgrade the TLS handshake error log from error! to debug! when the error is an immediate EOF, reducing misleading noise. Both changes are low-risk and follow established Kubernetes and Rust logging patterns.
Scope Assessment
- Complexity: Low
- Confidence: High — clear path forward, standard Kubernetes pattern
- Estimated files to change: 3-4 (statefulset.yaml, values.yaml, lib.rs, possibly cluster-healthcheck.sh)
- Issue type:
fix
Risks & Open Questions
- startupProbe failureThreshold: How long should the startup window be? 60 seconds covers flannel (19s) with generous margin, but slower environments (CI, constrained machines) may need more. Should this be configurable via Helm values?
- TLS log downgrade scope: Need to ensure we only downgrade log level for the "immediate EOF" pattern (probe behavior) and not for genuine TLS configuration errors. An EOF with zero application bytes is the signal.
- K8s version requirement:
startupProbeis GA since K8s 1.20. K3s v1.35.2 in the logs supports this.
Test Considerations
- Manual testing: Run
openshell sandbox create -- claudeon macOS with OrbStack and verify the pod starts without CrashLoopBackOff - E2E tests: The existing
mise run e2esuite should cover cluster startup; verify it passes with the new probe configuration - Unit tests: The TLS log level change can be verified by checking that a tcpSocket-style connect+disconnect produces a debug log, not an error
- Edge case: Test on a resource-constrained environment where flannel takes longer than usual to initialize
Created by spike investigation. Use build-from-issue to plan and implement.