Skip to content

openshell sandbox create --from <Dockerfile> fails with timeout when the resulting image is large (e.g., 2.5 GB). #485

@paritoshd-nv

Description

@paritoshd-nv

Agent Diagnostic

Issue: openshell sandbox create --from fails with timeout when the resulting image is large (e.g., 2.5 GB).


Root Cause

The upload timeout is fixed at 120 seconds — the bollard library's default for Docker::connect_with_local_defaults(). This timeout applies to the entire HTTP PUT
request that uploads the image tar into the gateway container.

For a 2.5 GB image, even over a local Unix socket, the gateway container needs to receive the data and write it to its overlay filesystem. On a loaded system or slow
storage, 120 seconds is not enough.

Two code paths are affected:

  • crates/openshell-bootstrap/src/build.rs:50 — sandbox create --from path: builds image locally, pushes into gateway
  • crates/openshell-bootstrap/src/lib.rs:472 — OPENSHELL_PUSH_IMAGES dev workflow: pushes pre-built images into gateway

By contrast, the SSH Docker client in docker.rs:250 already uses a 600-second timeout, indicating the upload timeout problem was known for the remote case but the
local case was overlooked.


Proposed Fix

Bollard 0.20 provides Docker::connect_with_defaults() which auto-detects DOCKER_HOST and supports a .with_timeout() builder. Replace connect_with_local_defaults()
with this at both callsites. No new helpers needed.

build.rs:50:
// Before
let local_docker = Docker::connect_with_local_defaults()
.into_diagnostic()
.wrap_err("failed to connect to local Docker daemon")?;

// After
let local_docker = Docker::connect_with_defaults()
.into_diagnostic()
.wrap_err("failed to connect to local Docker daemon")?
.with_timeout(std::time::Duration::from_secs(3600));

lib.rs:472:
// Before
let local_docker = Docker::connect_with_local_defaults().into_diagnostic()?;

// After
let local_docker = Docker::connect_with_defaults()
.into_diagnostic()?
.with_timeout(std::time::Duration::from_secs(3600));

Side benefit: connect_with_defaults() also correctly respects DOCKER_HOST for unix://, tcp://, and http:// schemes, which connect_with_local_defaults() silently
ignores.

Scope: 2 files, ~6 lines changed. No API surface changes. No protocol changes. Backwards compatible.

Description

Original issue reported in NVIDIA/NemoClaw:
NVIDIA/NemoClaw#427

Issue: openshell sandbox create --from fails with timeout when the resulting image is large (e.g., 2.5 GB).

It seems the the upload timeout is fixed at 120 seconds, the bollard library's default for Docker::connect_with_local_defaults(). This timeout applies to the entire HTTP PUT request that uploads the image tar into the gateway container.

Reproduction Steps

Check the original issue : NVIDIA/NemoClaw#427

Environment

  • OS: Ubuntu
  • openshell 0.0.10
  • nemoclaw 0.1.0

Check more details here: NVIDIA/NemoClaw#427

Logs

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions