fix: pod CrashLoopBackOff during cluster startup due to flannel race and aggressive liveness timing

## Problem Statement

Running `openshell sandbox create -- claude` on macOS Tahoe (Apple Silicon) with Docker on OrbStack results in persistent pod CrashLoopBackOff during cluster startup. The root cause is a flannel CNI initialization race combined with an overly aggressive liveness probe that kills pods before the network layer is ready. A secondary issue is that the server logs misleading "TLS handshake failed" errors from the tcpSocket liveness probe, making it appear that the probe itself is misconfigured.

## Technical Context

The K3s single-node cluster starts and immediately begins scheduling pods (CoreDNS, agent-sandbox-controller, metrics-server). However, Flannel CNI — which runs as a DaemonSet inside K3s — takes ~19 seconds to initialize and write `/run/flannel/subnet.env`. All pods scheduled before that file exists fail with `failed to load flannel 'subnet.env' file: no such file or directory`. The existing entrypoint already creates `/run/flannel` directory (acknowledging the race in a comment), but this is insufficient because the directory alone doesn't prevent the error — the `subnet.env` file must exist.

Compounding this, the liveness probe on the openshell StatefulSet has `initialDelaySeconds: 2` and `failureThreshold: 3` with `periodSeconds: 5`, meaning the pod gets killed after only **17 seconds** (2 + 3×5) of not responding — just before flannel becomes ready at ~19 seconds. The pod restarts, enters CrashLoopBackOff with exponential backoff, and the cycle continues.

The reported "tls handshake eof" errors are actually expected behavior: the `tcpSocket` probe connects at TCP level, the server accepts the connection and attempts a TLS handshake, the probe disconnects (having confirmed the port is open), and the server logs a TLS error. The probe itself succeeds — but the error log is misleading.

## Affected Components

| Component | Key Files | Role |
|-----------|-----------|------|
| Helm chart | `deploy/helm/openshell/templates/statefulset.yaml`, `deploy/helm/openshell/values.yaml` | Defines liveness/readiness probes for the openshell StatefulSet |
| Server | `crates/openshell-server/src/lib.rs`, `crates/openshell-server/src/http.rs` | TCP+TLS accept loop and health endpoints |
| Cluster bootstrap | `deploy/docker/cluster-entrypoint.sh`, `deploy/docker/Dockerfile.cluster` | K3s startup sequence and flannel directory pre-creation |
| Health check | `deploy/docker/cluster-healthcheck.sh` | Docker HEALTHCHECK for the cluster container |
| CLI bootstrap | `crates/openshell-bootstrap/src/runtime.rs` | `wait_for_gateway_ready` polling logic |

## Technical Investigation

### Architecture Overview

The cluster startup sequence is:
1. `cluster-entrypoint.sh` runs, creates `/run/flannel` directory (line 473-483), then execs K3s (line 506)
2. K3s starts the API server, kubelet, and scheduler
3. Kubelet immediately tries to schedule pending pods (CoreDNS, local-path-provisioner, openshell StatefulSet)
4. Flannel DaemonSet pod is created but also needs network to start (chicken-and-egg, resolved by host networking)
5. Flannel initializes after ~19 seconds, writes `/run/flannel/subnet.env`
6. Pods scheduled before step 5 fail with missing subnet.env

The openshell server (`crates/openshell-server/src/lib.rs:167-211`) listens with TLS via a `TcpListener` + `TlsAcceptor`. The TCP accept happens first, then TLS handshake in a spawned task. The `tcpSocket` K8s probe only checks TCP connectivity, so it succeeds at the TCP level even with TLS enabled.

### Code References

| Location | Description |
|----------|-------------|
| `deploy/helm/openshell/templates/statefulset.yaml:113-119` | Liveness probe: `tcpSocket` on port `grpc`, `initialDelaySeconds: 2`, `failureThreshold: 3`, `periodSeconds: 5` |
| `deploy/helm/openshell/templates/statefulset.yaml:120-126` | Readiness probe: same tcpSocket pattern |
| `deploy/helm/openshell/values.yaml:45-55` | Default probe timing values |
| `crates/openshell-server/src/lib.rs:180-211` | TLS accept loop — accepts TCP, then attempts TLS handshake in spawned task |
| `crates/openshell-server/src/lib.rs:201` | `error!(error = %e, client = %addr, "TLS handshake failed")` — logs probe disconnects as errors |
| `crates/openshell-server/src/http.rs:21-46` | Health endpoints (`/health`, `/healthz`, `/readyz`) |
| `deploy/docker/cluster-entrypoint.sh:473-483` | Pre-creates `/run/flannel` dir with comment acknowledging the race condition |
| `deploy/docker/cluster-entrypoint.sh:506` | `exec /bin/k3s server ...` — starts K3s |
| `deploy/docker/Dockerfile.cluster:271-272` | Docker HEALTHCHECK with `--start-period=20s` |
| `deploy/docker/cluster-healthcheck.sh:35-51` | Checks K8s API + StatefulSet readiness |
| `crates/openshell-bootstrap/src/runtime.rs:44-226` | `wait_for_gateway_ready`: 180 attempts × 2s = 360s budget |

### Current Behavior

1. K3s starts, kubelet schedules pods immediately
2. Flannel not ready → all pods fail with missing `subnet.env` → CrashLoopBackOff
3. Flannel becomes ready at ~19s
4. Liveness probe kills openshell pod at ~17s (before flannel is ready)
5. Pod restarts, enters CrashLoopBackOff with exponential backoff
6. Server logs "TLS handshake failed" for every tcpSocket probe connection (noise)
7. Cluster may eventually recover within the 360s CLI timeout, but startup is unreliable

### What Would Need to Change

**Fix 1 — Add startupProbe to StatefulSet (primary fix):**
The Helm chart template needs a `startupProbe` added to the container spec. This is the standard Kubernetes pattern for pods with slow startup. With a `startupProbe`, the `livenessProbe` is disabled until the startup probe succeeds, preventing premature kills during the flannel initialization window.

**Fix 2 — Downgrade TLS handshake error log level:**
In `crates/openshell-server/src/lib.rs:201`, the `error!` macro should be changed to `debug!` (or conditionally downgraded) when the TLS error indicates an immediate EOF — which is the `tcpSocket` probe pattern. This reduces misleading log noise.

### Alternative Approaches Considered

| Approach | Trade-off |
|----------|-----------|
| Increase `initialDelaySeconds` on liveness probe | Delays detection of real failures; doesn't decouple startup from steady-state |
| Wait for flannel in entrypoint before exec'ing k3s | Impossible — flannel runs inside k3s |
| Use `--flannel-backend=none` + external flannel | High complexity, maintenance burden |
| Switch to gRPC health probe | More correct but requires implementing gRPC health service |
| Add exec probe with `curl -k https://localhost:8080/healthz` | Validates full stack but requires curl in container |

### Patterns to Follow

The existing probe definitions in `statefulset.yaml:113-126` follow the standard Helm chart pattern with values from `values.yaml`. The `startupProbe` should follow the same pattern, using `tcpSocket` on the `grpc` port with configurable values.

## Proposed Approach

Add a `startupProbe` to the StatefulSet template with generous timing (e.g., `periodSeconds: 2`, `failureThreshold: 30` = 60s startup window). This decouples slow-start tolerance from the steady-state liveness check. Additionally, downgrade the TLS handshake error log from `error!` to `debug!` when the error is an immediate EOF, reducing misleading noise. Both changes are low-risk and follow established Kubernetes and Rust logging patterns.

## Scope Assessment

- **Complexity:** Low
- **Confidence:** High — clear path forward, standard Kubernetes pattern
- **Estimated files to change:** 3-4 (statefulset.yaml, values.yaml, lib.rs, possibly cluster-healthcheck.sh)
- **Issue type:** `fix`

## Risks & Open Questions

- **startupProbe failureThreshold**: How long should the startup window be? 60 seconds covers flannel (19s) with generous margin, but slower environments (CI, constrained machines) may need more. Should this be configurable via Helm values?
- **TLS log downgrade scope**: Need to ensure we only downgrade log level for the "immediate EOF" pattern (probe behavior) and not for genuine TLS configuration errors. An EOF with zero application bytes is the signal.
- **K8s version requirement**: `startupProbe` is GA since K8s 1.20. K3s v1.35.2 in the logs supports this.

## Test Considerations

- **Manual testing**: Run `openshell sandbox create -- claude` on macOS with OrbStack and verify the pod starts without CrashLoopBackOff
- **E2E tests**: The existing `mise run e2e` suite should cover cluster startup; verify it passes with the new probe configuration
- **Unit tests**: The TLS log level change can be verified by checking that a tcpSocket-style connect+disconnect produces a debug log, not an error
- **Edge case**: Test on a resource-constrained environment where flannel takes longer than usual to initialize

---
*Created by spike investigation. Use `build-from-issue` to plan and implement.*

Location	Description
`deploy/helm/openshell/templates/statefulset.yaml:113-119`	Liveness probe: `tcpSocket` on port `grpc`, `initialDelaySeconds: 2`, `failureThreshold: 3`, `periodSeconds: 5`
`deploy/helm/openshell/templates/statefulset.yaml:120-126`	Readiness probe: same tcpSocket pattern
`deploy/helm/openshell/values.yaml:45-55`	Default probe timing values
`crates/openshell-server/src/lib.rs:180-211`	TLS accept loop — accepts TCP, then attempts TLS handshake in spawned task
`crates/openshell-server/src/lib.rs:201`	`error!(error = %e, client = %addr, "TLS handshake failed")` — logs probe disconnects as errors
`crates/openshell-server/src/http.rs:21-46`	Health endpoints (`/health`, `/healthz`, `/readyz`)
`deploy/docker/cluster-entrypoint.sh:473-483`	Pre-creates `/run/flannel` dir with comment acknowledging the race condition
`deploy/docker/cluster-entrypoint.sh:506`	`exec /bin/k3s server ...` — starts K3s
`deploy/docker/Dockerfile.cluster:271-272`	Docker HEALTHCHECK with `--start-period=20s`
`deploy/docker/cluster-healthcheck.sh:35-51`	Checks K8s API + StatefulSet readiness
`crates/openshell-bootstrap/src/runtime.rs:44-226`	`wait_for_gateway_ready`: 180 attempts × 2s = 360s budget

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pod CrashLoopBackOff during cluster startup due to flannel race and aggressive liveness timing #409

Problem Statement

Technical Context

Affected Components

Technical Investigation

Architecture Overview

Code References

Current Behavior

What Would Need to Change

Alternative Approaches Considered

Patterns to Follow

Proposed Approach

Scope Assessment

Risks & Open Questions

Test Considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Key Files	Role
Helm chart	`deploy/helm/openshell/templates/statefulset.yaml`, `deploy/helm/openshell/values.yaml`	Defines liveness/readiness probes for the openshell StatefulSet
Server	`crates/openshell-server/src/lib.rs`, `crates/openshell-server/src/http.rs`	TCP+TLS accept loop and health endpoints
Cluster bootstrap	`deploy/docker/cluster-entrypoint.sh`, `deploy/docker/Dockerfile.cluster`	K3s startup sequence and flannel directory pre-creation
Health check	`deploy/docker/cluster-healthcheck.sh`	Docker HEALTHCHECK for the cluster container
CLI bootstrap	`crates/openshell-bootstrap/src/runtime.rs`	`wait_for_gateway_ready` polling logic

Approach	Trade-off
Increase `initialDelaySeconds` on liveness probe	Delays detection of real failures; doesn't decouple startup from steady-state
Wait for flannel in entrypoint before exec'ing k3s	Impossible — flannel runs inside k3s
Use `--flannel-backend=none` + external flannel	High complexity, maintenance burden
Switch to gRPC health probe	More correct but requires implementing gRPC health service
Add exec probe with `curl -k https://localhost:8080/healthz`	Validates full stack but requires curl in container

fix: pod CrashLoopBackOff during cluster startup due to flannel race and aggressive liveness timing #409

Description

Problem Statement

Technical Context

Affected Components

Technical Investigation

Architecture Overview

Code References

Current Behavior

What Would Need to Change

Alternative Approaches Considered

Patterns to Follow

Proposed Approach

Scope Assessment

Risks & Open Questions

Test Considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions