-
Notifications
You must be signed in to change notification settings - Fork 321
Description
Agent Diagnostic
Investigated the deploy flow in crates/openshell-bootstrap/src/lib.rs (deploy_gateway_with_logs(), lines 253-516). The function creates Docker resources incrementally — network, volume, container — then proceeds through container start, PKI reconciliation, and health check. Every step after ensure_volume() (line 339) uses ? to propagate errors, but no cleanup guard exists. If any step after volume creation fails, the volume openshell-cluster-{name} is left behind in a partially-initialized state.
destroy_gateway_resources() in crates/openshell-bootstrap/src/docker.rs:904-911 correctly removes the volume (via docker.remove_volume()), and openshell gateway destroy calls it. However, nothing invokes this cleanup automatically on a failed deploy.
The error diagnosis in crates/openshell-bootstrap/src/errors.rs:180-191 tells users to manually run openshell gateway destroy and then docker volume rm, but this is guidance only — not automatic.
Downstream consumers (e.g., NemoClaw PR #337) are working around this by shelling out to docker volume rm in their onboarding scripts. This should be fixed at the source in OpenShell.
Description
Actual behavior: When openshell gateway start fails after the Docker volume has been created (e.g., container start failure, PKI error, health check timeout), the volume openshell-cluster-{name} is left behind. Subsequent openshell gateway start attempts detect the orphaned volume as an existing gateway ("volume only" state) and either fail with "Corrupted cluster state" or prompt to recreate, even though the original deploy never completed.
Expected behavior: On deploy failure, OpenShell should automatically clean up the Docker volume, container, network, and image so the environment is left in a clean, retryable state. If automatic cleanup fails, the error message should include the exact manual recovery command.
Reproduction Steps
- Trigger a gateway deploy failure (e.g., use a port already in use, or kill Docker mid-deploy)
- Run
openshell gateway startagain - Observe that it detects the orphaned volume and fails or prompts unnecessarily
Environment
- All platforms (local Docker, Colima, Docker Desktop)
- All OpenShell versions with the current bootstrap crate
Fix
Add a cleanup guard in deploy_gateway_with_logs() that calls destroy_gateway_resources() when any step after resource creation fails, before propagating the error.