Conversation
- Introduced default workspace volume size configuration in StorageConfig. - Implemented endpoints for retrieving sandbox logs and establishing interactive terminal sessions via WebSocket. - Enhanced Kubernetes sandbox service to support pausing and resuming workloads, along with PVC management during these operations. - Updated middleware to allow WebSocket connections without HTTP auth and request ID processing.
- Created a comprehensive runbook detailing the deployment process for SuperSandbox on Kubernetes. - Included prerequisites, installation steps for gVisor and agent-sandbox-controller, and configuration guidelines. - Provided verification steps and API quick reference for managing sandboxes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d720925f99
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| async def __call__(self, scope, receive, send): | ||
| """Pass WebSocket connections through without HTTP auth.""" | ||
| if scope["type"] == "websocket": | ||
| await self.app(scope, receive, send) | ||
| return |
There was a problem hiding this comment.
Enforce API-key checks on WebSocket terminal sessions
When server.api_key is configured, this early return exempts every WebSocket handshake from AuthMiddleware, and the new sandbox_terminal() route in server/src/api/lifecycle.py does not perform any manual header validation before opening a shell. That leaves /sandboxes/{id}/terminal reachable without OPEN-SANDBOX-API-KEY, which bypasses the service's only auth control for the most privileged sandbox operation.
Useful? React with 👍 / 👎.
| raise NotImplementedError("Pause is not supported for BatchSandbox provider") | ||
|
|
||
| def resume_workload(self, sandbox_id: str, namespace: str) -> None: | ||
| """Resume is not supported for BatchSandbox provider.""" | ||
| raise NotImplementedError("Resume is not supported for BatchSandbox provider") |
There was a problem hiding this comment.
Avoid turning default BatchSandbox pause/resume into 500s
On Kubernetes setups that use the default provider selection (create_workload_provider() falls back to batchsandbox), both pause and resume now raise NotImplementedError here. KubernetesSandboxService.pause_sandbox() and resume_sandbox() catch that as a generic exception and translate it to HTTP 500, so the existing pause/resume endpoints regress from a clear 'unsupported' response to an internal-server-error for the default provider.
Useful? React with 👍 / 👎.
| self.k8s_client.create_pvc( | ||
| namespace=namespace, | ||
| name=pvc_name, | ||
| storage_size=self._workspace_volume_size, | ||
| labels={SANDBOX_ID_LABEL: sandbox_id}, |
There was a problem hiding this comment.
Tie workspace PVC cleanup to the Sandbox lifecycle
This workspace claim is created as a standalone PVC and only receives a label; nothing in this flow adds an ownerReference after the Sandbox CR is created. As a result, agent-sandbox workloads that expire via shutdownPolicy=Delete or are removed out of band will leak their PVC permanently, because delete_workload() is never called to run the manual cleanup path.
Useful? React with 👍 / 👎.
| raise NotImplementedError("Logs are not yet implemented for Docker runtime") | ||
|
|
||
| def exec_sandbox_terminal(self, sandbox_id: str, command: str = "/bin/bash"): | ||
| """Open terminal to a Docker sandbox container.""" | ||
| raise NotImplementedError("Terminal is not yet implemented for Docker runtime") |
There was a problem hiding this comment.
Return a supported error for Docker logs and terminal
The new /sandboxes/{id}/logs and /sandboxes/{id}/terminal routes are registered unconditionally in server/src/api/lifecycle.py, but the Docker backend now throws raw NotImplementedErrors here. On a Docker deployment, the log endpoint becomes an HTTP 500 and the terminal socket closes with code 1011, which looks like a server failure instead of an unsupported capability.
Useful? React with 👍 / 👎.
Summary
Testing
Breaking Changes
Checklist