-
Notifications
You must be signed in to change notification settings - Fork 279
Description
Agent Diagnostic
Issue: openshell sandbox create --from fails with timeout when the resulting image is large (e.g., 2.5 GB).
Root Cause
The upload timeout is fixed at 120 seconds — the bollard library's default for Docker::connect_with_local_defaults(). This timeout applies to the entire HTTP PUT
request that uploads the image tar into the gateway container.
For a 2.5 GB image, even over a local Unix socket, the gateway container needs to receive the data and write it to its overlay filesystem. On a loaded system or slow
storage, 120 seconds is not enough.
Two code paths are affected:
- crates/openshell-bootstrap/src/build.rs:50 — sandbox create --from path: builds image locally, pushes into gateway
- crates/openshell-bootstrap/src/lib.rs:472 — OPENSHELL_PUSH_IMAGES dev workflow: pushes pre-built images into gateway
By contrast, the SSH Docker client in docker.rs:250 already uses a 600-second timeout, indicating the upload timeout problem was known for the remote case but the
local case was overlooked.
Proposed Fix
Bollard 0.20 provides Docker::connect_with_defaults() which auto-detects DOCKER_HOST and supports a .with_timeout() builder. Replace connect_with_local_defaults()
with this at both callsites. No new helpers needed.
build.rs:50:
// Before
let local_docker = Docker::connect_with_local_defaults()
.into_diagnostic()
.wrap_err("failed to connect to local Docker daemon")?;
// After
let local_docker = Docker::connect_with_defaults()
.into_diagnostic()
.wrap_err("failed to connect to local Docker daemon")?
.with_timeout(std::time::Duration::from_secs(3600));
lib.rs:472:
// Before
let local_docker = Docker::connect_with_local_defaults().into_diagnostic()?;
// After
let local_docker = Docker::connect_with_defaults()
.into_diagnostic()?
.with_timeout(std::time::Duration::from_secs(3600));
Side benefit: connect_with_defaults() also correctly respects DOCKER_HOST for unix://, tcp://, and http:// schemes, which connect_with_local_defaults() silently
ignores.
Scope: 2 files, ~6 lines changed. No API surface changes. No protocol changes. Backwards compatible.
Description
Original issue reported in NVIDIA/NemoClaw:
NVIDIA/NemoClaw#427
Issue: openshell sandbox create --from fails with timeout when the resulting image is large (e.g., 2.5 GB).
It seems the the upload timeout is fixed at 120 seconds, the bollard library's default for Docker::connect_with_local_defaults(). This timeout applies to the entire HTTP PUT request that uploads the image tar into the gateway container.
Reproduction Steps
Check the original issue : NVIDIA/NemoClaw#427
Environment
- OS: Ubuntu
- openshell 0.0.10
- nemoclaw 0.1.0
Check more details here: NVIDIA/NemoClaw#427
Logs
Agent-First Checklist
- I pointed my agent at the repo and had it investigate this issue
- I loaded relevant skills (e.g.,
debug-openshell-cluster,debug-inference,openshell-cli) - My agent could not resolve this — the diagnostic above explains why