Summary
In production environments, we are observing two critical issues that lead to sandbox startup failures, ambiguous user-facing errors, and poor debuggability:
-
client-proxy floods with 502 Reverse proxy errors
- No upstream health checking
- No TCP/grpc connectivity validation before forwarding
- Returns generic 502 instead of meaningful 503 Service Unavailable
- Causes sandbox process startup to fail intermittently
-
orchestration-api returns misleading 404 on snapshot not found
- Log:
snapshot not found
- User error:
Sandbox doesn't exist or you don't have access to it
- No detailed logging for why snapshot lookup failed (missing object, path, permissions, cache)
- Service still registers routes as ready despite being unable to serve snapshots
These issues make it extremely hard to debug sandbox startup failures and provide a bad user experience.
Log Evidence
client-proxy 502
Reverse proxy error {"service": "client-proxy", "shturl.cc/": "im2pa8g5739aezjy57vr1", "target_hostname": "10.254.73.19", "target_port": "5007", "status_code": 502}
orchestration-api snapshot not found
snapshot not found {"service": "orchestration-api", "shturl.cc/": "im2pa8g5739aezjy57vr1"}
Summary
In production environments, we are observing two critical issues that lead to sandbox startup failures, ambiguous user-facing errors, and poor debuggability:
client-proxy floods with 502 Reverse proxy errors
orchestration-api returns misleading 404 on snapshot not found
snapshot not foundSandbox doesn't exist or you don't have access to itThese issues make it extremely hard to debug sandbox startup failures and provide a bad user experience.
Log Evidence
client-proxy 502
Reverse proxy error {"service": "client-proxy", "shturl.cc/": "im2pa8g5739aezjy57vr1", "target_hostname": "10.254.73.19", "target_port": "5007", "status_code": 502}
orchestration-api snapshot not found
snapshot not found {"service": "orchestration-api", "shturl.cc/": "im2pa8g5739aezjy57vr1"}