Skip to content

fix: pod CrashLoopBackOff during cluster startup due to flannel race and aggressive liveness timing #409

@richardjortega

Description

@richardjortega

Problem Statement

Running openshell sandbox create -- claude on macOS Tahoe (Apple Silicon) with Docker on OrbStack results in persistent pod CrashLoopBackOff during cluster startup. The root cause is a flannel CNI initialization race combined with an overly aggressive liveness probe that kills pods before the network layer is ready. A secondary issue is that the server logs misleading "TLS handshake failed" errors from the tcpSocket liveness probe, making it appear that the probe itself is misconfigured.

Technical Context

The K3s single-node cluster starts and immediately begins scheduling pods (CoreDNS, agent-sandbox-controller, metrics-server). However, Flannel CNI — which runs as a DaemonSet inside K3s — takes ~19 seconds to initialize and write /run/flannel/subnet.env. All pods scheduled before that file exists fail with failed to load flannel 'subnet.env' file: no such file or directory. The existing entrypoint already creates /run/flannel directory (acknowledging the race in a comment), but this is insufficient because the directory alone doesn't prevent the error — the subnet.env file must exist.

Compounding this, the liveness probe on the openshell StatefulSet has initialDelaySeconds: 2 and failureThreshold: 3 with periodSeconds: 5, meaning the pod gets killed after only 17 seconds (2 + 3×5) of not responding — just before flannel becomes ready at ~19 seconds. The pod restarts, enters CrashLoopBackOff with exponential backoff, and the cycle continues.

The reported "tls handshake eof" errors are actually expected behavior: the tcpSocket probe connects at TCP level, the server accepts the connection and attempts a TLS handshake, the probe disconnects (having confirmed the port is open), and the server logs a TLS error. The probe itself succeeds — but the error log is misleading.

Affected Components

Component Key Files Role
Helm chart deploy/helm/openshell/templates/statefulset.yaml, deploy/helm/openshell/values.yaml Defines liveness/readiness probes for the openshell StatefulSet
Server crates/openshell-server/src/lib.rs, crates/openshell-server/src/http.rs TCP+TLS accept loop and health endpoints
Cluster bootstrap deploy/docker/cluster-entrypoint.sh, deploy/docker/Dockerfile.cluster K3s startup sequence and flannel directory pre-creation
Health check deploy/docker/cluster-healthcheck.sh Docker HEALTHCHECK for the cluster container
CLI bootstrap crates/openshell-bootstrap/src/runtime.rs wait_for_gateway_ready polling logic

Technical Investigation

Architecture Overview

The cluster startup sequence is:

  1. cluster-entrypoint.sh runs, creates /run/flannel directory (line 473-483), then execs K3s (line 506)
  2. K3s starts the API server, kubelet, and scheduler
  3. Kubelet immediately tries to schedule pending pods (CoreDNS, local-path-provisioner, openshell StatefulSet)
  4. Flannel DaemonSet pod is created but also needs network to start (chicken-and-egg, resolved by host networking)
  5. Flannel initializes after ~19 seconds, writes /run/flannel/subnet.env
  6. Pods scheduled before step 5 fail with missing subnet.env

The openshell server (crates/openshell-server/src/lib.rs:167-211) listens with TLS via a TcpListener + TlsAcceptor. The TCP accept happens first, then TLS handshake in a spawned task. The tcpSocket K8s probe only checks TCP connectivity, so it succeeds at the TCP level even with TLS enabled.

Code References

Location Description
deploy/helm/openshell/templates/statefulset.yaml:113-119 Liveness probe: tcpSocket on port grpc, initialDelaySeconds: 2, failureThreshold: 3, periodSeconds: 5
deploy/helm/openshell/templates/statefulset.yaml:120-126 Readiness probe: same tcpSocket pattern
deploy/helm/openshell/values.yaml:45-55 Default probe timing values
crates/openshell-server/src/lib.rs:180-211 TLS accept loop — accepts TCP, then attempts TLS handshake in spawned task
crates/openshell-server/src/lib.rs:201 error!(error = %e, client = %addr, "TLS handshake failed") — logs probe disconnects as errors
crates/openshell-server/src/http.rs:21-46 Health endpoints (/health, /healthz, /readyz)
deploy/docker/cluster-entrypoint.sh:473-483 Pre-creates /run/flannel dir with comment acknowledging the race condition
deploy/docker/cluster-entrypoint.sh:506 exec /bin/k3s server ... — starts K3s
deploy/docker/Dockerfile.cluster:271-272 Docker HEALTHCHECK with --start-period=20s
deploy/docker/cluster-healthcheck.sh:35-51 Checks K8s API + StatefulSet readiness
crates/openshell-bootstrap/src/runtime.rs:44-226 wait_for_gateway_ready: 180 attempts × 2s = 360s budget

Current Behavior

  1. K3s starts, kubelet schedules pods immediately
  2. Flannel not ready → all pods fail with missing subnet.env → CrashLoopBackOff
  3. Flannel becomes ready at ~19s
  4. Liveness probe kills openshell pod at ~17s (before flannel is ready)
  5. Pod restarts, enters CrashLoopBackOff with exponential backoff
  6. Server logs "TLS handshake failed" for every tcpSocket probe connection (noise)
  7. Cluster may eventually recover within the 360s CLI timeout, but startup is unreliable

What Would Need to Change

Fix 1 — Add startupProbe to StatefulSet (primary fix):
The Helm chart template needs a startupProbe added to the container spec. This is the standard Kubernetes pattern for pods with slow startup. With a startupProbe, the livenessProbe is disabled until the startup probe succeeds, preventing premature kills during the flannel initialization window.

Fix 2 — Downgrade TLS handshake error log level:
In crates/openshell-server/src/lib.rs:201, the error! macro should be changed to debug! (or conditionally downgraded) when the TLS error indicates an immediate EOF — which is the tcpSocket probe pattern. This reduces misleading log noise.

Alternative Approaches Considered

Approach Trade-off
Increase initialDelaySeconds on liveness probe Delays detection of real failures; doesn't decouple startup from steady-state
Wait for flannel in entrypoint before exec'ing k3s Impossible — flannel runs inside k3s
Use --flannel-backend=none + external flannel High complexity, maintenance burden
Switch to gRPC health probe More correct but requires implementing gRPC health service
Add exec probe with curl -k https://localhost:8080/healthz Validates full stack but requires curl in container

Patterns to Follow

The existing probe definitions in statefulset.yaml:113-126 follow the standard Helm chart pattern with values from values.yaml. The startupProbe should follow the same pattern, using tcpSocket on the grpc port with configurable values.

Proposed Approach

Add a startupProbe to the StatefulSet template with generous timing (e.g., periodSeconds: 2, failureThreshold: 30 = 60s startup window). This decouples slow-start tolerance from the steady-state liveness check. Additionally, downgrade the TLS handshake error log from error! to debug! when the error is an immediate EOF, reducing misleading noise. Both changes are low-risk and follow established Kubernetes and Rust logging patterns.

Scope Assessment

  • Complexity: Low
  • Confidence: High — clear path forward, standard Kubernetes pattern
  • Estimated files to change: 3-4 (statefulset.yaml, values.yaml, lib.rs, possibly cluster-healthcheck.sh)
  • Issue type: fix

Risks & Open Questions

  • startupProbe failureThreshold: How long should the startup window be? 60 seconds covers flannel (19s) with generous margin, but slower environments (CI, constrained machines) may need more. Should this be configurable via Helm values?
  • TLS log downgrade scope: Need to ensure we only downgrade log level for the "immediate EOF" pattern (probe behavior) and not for genuine TLS configuration errors. An EOF with zero application bytes is the signal.
  • K8s version requirement: startupProbe is GA since K8s 1.20. K3s v1.35.2 in the logs supports this.

Test Considerations

  • Manual testing: Run openshell sandbox create -- claude on macOS with OrbStack and verify the pod starts without CrashLoopBackOff
  • E2E tests: The existing mise run e2e suite should cover cluster startup; verify it passes with the new probe configuration
  • Unit tests: The TLS log level change can be verified by checking that a tcpSocket-style connect+disconnect produces a debug log, not an error
  • Edge case: Test on a resource-constrained environment where flannel takes longer than usual to initialize

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions