Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-op
Common issues:

- **Replicas 0/0**: The StatefulSet has been scaled to zero — no pods are running. This can happen after a failed deploy, manual scale-down, or Helm values misconfiguration. Fix: `openshell doctor exec -- kubectl -n openshell scale statefulset openshell --replicas=1`
- **ImagePullBackOff**: The component image failed to pull. In `internal` mode, verify internal registry readiness and pushed image tags (Step 5). In `external` mode, check `/etc/rancher/k3s/registries.yaml` credentials/endpoints and DNS (Step 8). Default external registry is `ghcr.io/nvidia/openshell/`. Ensure a valid `--registry-token` (or `OPENSHELL_REGISTRY_TOKEN`) was provided during deploy.
- **ImagePullBackOff**: The component image failed to pull. In `internal` mode, verify internal registry readiness and pushed image tags (Step 5). In `external` mode, check `/etc/rancher/k3s/registries.yaml` credentials/endpoints and DNS (Step 8). Default external registry is `ghcr.io/nvidia/openshell/` (public, no auth required). If using a private registry, ensure `--registry-username` and `--registry-token` (or `OPENSHELL_REGISTRY_USERNAME`/`OPENSHELL_REGISTRY_TOKEN`) were provided during deploy.
- **CrashLoopBackOff**: The server is crashing. Check pod logs for the actual error.
- **Pending**: Insufficient resources or scheduling constraints.

Expand Down
19 changes: 2 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,25 +21,10 @@ Want to run on cloud compute? [Launch on Brev](https://brev.nvidia.com/launchabl

### Install

**Binary (recommended — requires [GitHub CLI](https://cli.github.com)):**
**Binary (recommended):**

```bash
sh -c 'ARCH=$(uname -m); OS=$(uname -s); \
case "${OS}-${ARCH}" in \
Linux-x86_64) ASSET="openshell-x86_64-unknown-linux-musl.tar.gz" ;; \
Linux-aarch64) ASSET="openshell-aarch64-unknown-linux-musl.tar.gz" ;; \
Darwin-arm64) ASSET="openshell-aarch64-apple-darwin.tar.gz" ;; \
*) echo "Unsupported platform: ${OS}-${ARCH}" >&2; exit 1 ;; \
esac; \
gh release download devel --repo NVIDIA/OpenShell --pattern "${ASSET}" -O - \
| tar xz \
&& sudo install -m 755 openshell /usr/local/bin/openshell'
```

Or use the install script from the repository:

```bash
./install.sh
curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
```

**From PyPI (requires [uv](https://docs.astral.sh/uv/)):**
Expand Down
6 changes: 3 additions & 3 deletions architecture/gateway-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ Falls back to `8.8.8.8` / `8.8.4.4` if iptables detection fails.

### Registry configuration

Writes `/etc/rancher/k3s/registries.yaml` from `REGISTRY_HOST`, `REGISTRY_ENDPOINT`, `REGISTRY_USERNAME`, `REGISTRY_PASSWORD`, and `REGISTRY_INSECURE` environment variables so that k3s/containerd can authenticate when pulling component images at runtime.
Writes `/etc/rancher/k3s/registries.yaml` from `REGISTRY_HOST`, `REGISTRY_ENDPOINT`, `REGISTRY_USERNAME`, `REGISTRY_PASSWORD`, and `REGISTRY_INSECURE` environment variables so that k3s/containerd can authenticate when pulling component images at runtime. When no explicit credentials are provided (the default for public GHCR repos), the auth block is omitted and images are pulled anonymously.

### Manifest injection

Expand Down Expand Up @@ -392,8 +392,8 @@ Variables set on the container by `ensure_container()` in `docker.rs`:
| `REGISTRY_INSECURE` | `"true"` or `"false"` | Always |
| `IMAGE_REPO_BASE` | `{registry_host}/{namespace}` (or `IMAGE_REPO_BASE`/`OPENSHELL_IMAGE_REPO_BASE` override) | Always |
| `REGISTRY_ENDPOINT` | Custom endpoint URL | When `OPENSHELL_REGISTRY_ENDPOINT` is set |
| `REGISTRY_USERNAME` | Registry auth username | When credentials available |
| `REGISTRY_PASSWORD` | Registry auth password | When credentials available |
| `REGISTRY_USERNAME` | Registry auth username | When explicit credentials provided via `--registry-username`/`--registry-token` or env vars |
| `REGISTRY_PASSWORD` | Registry auth password | When explicit credentials provided via `--registry-username`/`--registry-token` or env vars |
| `EXTRA_SANS` | Comma-separated extra TLS SANs | When extra SANs computed |
| `SSH_GATEWAY_HOST` | Resolved remote hostname/IP | Remote deploys only |
| `SSH_GATEWAY_PORT` | Configured host port (default `8080`) | Remote deploys only |
Expand Down
46 changes: 17 additions & 29 deletions crates/openshell-bootstrap/src/docker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@

use crate::RemoteOptions;
use crate::constants::{container_name, network_name, volume_name};
use crate::image::{
self, DEFAULT_IMAGE_REPO_BASE, DEFAULT_REGISTRY, DEFAULT_REGISTRY_USERNAME, parse_image_ref,
};
use crate::image::{self, DEFAULT_IMAGE_REPO_BASE, DEFAULT_REGISTRY, parse_image_ref};
use bollard::API_DEFAULT_VERSION;
use bollard::Docker;
use bollard::errors::Error as BollardError;
Expand Down Expand Up @@ -403,6 +401,7 @@ pub async fn ensure_volume(docker: &Docker, name: &str) -> Result<()> {
pub async fn ensure_image(
docker: &Docker,
image_ref: &str,
registry_username: Option<&str>,
registry_token: Option<&str>,
) -> Result<()> {
match docker.inspect_image(image_ref).await {
Expand All @@ -423,9 +422,10 @@ pub async fn ensure_image(

let (repo, tag) = parse_image_ref(image_ref);

// Use GHCR credentials (explicit or built-in default) for ghcr.io images.
// Use explicit GHCR credentials when provided for ghcr.io images.
// Public repos are pulled without authentication by default.
let credentials = if repo.starts_with("ghcr.io/") {
image::ghcr_credentials(registry_token)
image::ghcr_credentials(registry_username, registry_token)
} else {
None
};
Expand All @@ -452,6 +452,7 @@ pub async fn ensure_container(
gateway_port: u16,
disable_tls: bool,
disable_gateway_auth: bool,
registry_username: Option<&str>,
registry_token: Option<&str>,
gpu: bool,
) -> Result<()> {
Expand Down Expand Up @@ -586,15 +587,17 @@ pub async fn ensure_container(

// Credential priority:
// 1. OPENSHELL_REGISTRY_USERNAME/PASSWORD env vars (power-user override)
// 2. registry_token from --registry-token / OPENSHELL_REGISTRY_TOKEN
// 3. Built-in default XOR-decoded token
let registry_username = env_non_empty("OPENSHELL_REGISTRY_USERNAME")
.or_else(|| Some(DEFAULT_REGISTRY_USERNAME.to_string()));
let registry_password = env_non_empty("OPENSHELL_REGISTRY_PASSWORD").or_else(|| {
// 2. registry_username/registry_token from CLI flags / env vars
// No built-in default — GHCR repos are public and pull without auth.
let effective_username = env_non_empty("OPENSHELL_REGISTRY_USERNAME").or_else(|| {
registry_username
.filter(|u| !u.is_empty())
.map(ToString::to_string)
});
let effective_password = env_non_empty("OPENSHELL_REGISTRY_PASSWORD").or_else(|| {
registry_token
.filter(|t| !t.is_empty())
.map(ToString::to_string)
.or_else(|| Some(image::default_registry_token()))
});

let mut env_vars: Vec<String> = vec![
Expand All @@ -606,28 +609,13 @@ pub async fn ensure_container(
if let Some(endpoint) = registry_endpoint {
env_vars.push(format!("REGISTRY_ENDPOINT={endpoint}"));
}
if let (Some(username), Some(password)) = (registry_username, registry_password) {
if let Some(password) = effective_password {
// Default to __token__ when only a password/token is provided.
let username = effective_username.unwrap_or_else(|| "__token__".to_string());
env_vars.push(format!("REGISTRY_USERNAME={username}"));
env_vars.push(format!("REGISTRY_PASSWORD={password}"));
}

// When the primary registry is NOT ghcr.io (e.g. a local registry in
// push-mode), we still need containerd credentials for the community
// registry so that community sandbox images
// (ghcr.io/nvidia/openshell-community/sandboxes/*) can be pulled at
// runtime. Pass community registry credentials as a separate set of
// env vars so the entrypoint can add a second block to registries.yaml.
if registry_host != DEFAULT_REGISTRY {
env_vars.push(format!("COMMUNITY_REGISTRY_HOST={DEFAULT_REGISTRY}"));
env_vars.push(format!(
"COMMUNITY_REGISTRY_USERNAME={DEFAULT_REGISTRY_USERNAME}"
));
env_vars.push(format!(
"COMMUNITY_REGISTRY_PASSWORD={}",
image::default_registry_token()
));
}

if !extra_sans.is_empty() {
env_vars.push(format!("EXTRA_SANS={}", extra_sans.join(",")));
}
Expand Down
7 changes: 5 additions & 2 deletions crates/openshell-bootstrap/src/errors.rs
Original file line number Diff line number Diff line change
Expand Up @@ -241,15 +241,18 @@ fn diagnose_image_pull_auth_failure(_gateway_name: &str) -> GatewayFailureDiagno
GatewayFailureDiagnosis {
summary: "Registry authentication failed".to_string(),
explanation: "Could not authenticate with the container registry. The image may not \
exist, or you may not have permission to access it."
exist, or you may not have permission to access it. Public GHCR repos \
should not require authentication — if you see this error with the default \
registry, it may indicate the image does not exist or a network issue."
.to_string(),
recovery_steps: vec![
RecoveryStep::with_command(
"Verify the image exists and you have access",
"docker pull ghcr.io/nvidia/openshell/cluster:latest",
),
RecoveryStep::new(
"If using a private registry, ensure OPENSHELL_REGISTRY_TOKEN is set",
"If using a private registry, set OPENSHELL_REGISTRY_USERNAME and OPENSHELL_REGISTRY_TOKEN \
(or use --registry-username and --registry-token)",
),
RecoveryStep::with_command("Check your Docker login", "docker login ghcr.io"),
],
Expand Down
100 changes: 30 additions & 70 deletions crates/openshell-bootstrap/src/image.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,42 +42,7 @@ pub const DEFAULT_GATEWAY_IMAGE: &str = "ghcr.io/nvidia/openshell/cluster";
///
/// GHCR accepts any non-empty username when authenticating with a PAT;
/// `__token__` is a common convention for token-based OCI registry auth.
pub const DEFAULT_REGISTRY_USERNAME: &str = "__token__";

// ---------------------------------------------------------------------------
// XOR-obfuscated default registry token
// ---------------------------------------------------------------------------
// A read-only GHCR PAT is XOR-encoded so it doesn't appear as plaintext in
// the compiled binary. This is a lightweight deterrent against casual
// inspection — it is NOT a security boundary. The `--registry-token` flag
// (or `OPENSHELL_REGISTRY_TOKEN` env var) overrides this default.

/// XOR key used to decode the default registry token.
const XOR_KEY: [u8; 32] = [
0x9c, 0x87, 0xc1, 0x0c, 0x00, 0xe2, 0x59, 0x14, 0x98, 0xb8, 0xa5, 0x45, 0x48, 0x40, 0x3e, 0x92,
0x62, 0x41, 0xfe, 0x5e, 0xd4, 0x09, 0x23, 0xe6, 0x85, 0xa7, 0x94, 0xab, 0xb8, 0x15, 0xcd, 0x45,
];

/// XOR-encoded default GHCR registry token.
const DEFAULT_REGISTRY_TOKEN_ENC: [u8; 40] = [
0xfb, 0xef, 0xb1, 0x53, 0x44, 0xb4, 0x6d, 0x71, 0xd0, 0xf0, 0xd1, 0x15, 0x09, 0x39, 0x72, 0xd7,
0x29, 0x36, 0xb7, 0x69, 0xe5, 0x64, 0x55, 0xaf, 0xee, 0xd2, 0xc0, 0xd2, 0xd1, 0x5b, 0x81, 0x0e,
0xd1, 0xf5, 0xf2, 0x5a, 0x6b, 0xa3, 0x14, 0x46,
];

/// Decode an XOR-encoded byte slice using [`XOR_KEY`].
fn xor_decode(encoded: &[u8]) -> String {
encoded
.iter()
.enumerate()
.map(|(i, b)| (b ^ XOR_KEY[i % XOR_KEY.len()]) as char)
.collect()
}

/// Default GHCR registry token, decoded at runtime.
pub(crate) fn default_registry_token() -> String {
xor_decode(&DEFAULT_REGISTRY_TOKEN_ENC)
}
const DEFAULT_REGISTRY_USERNAME: &str = "__token__";

/// Parse an image reference into (repository, tag).
///
Expand Down Expand Up @@ -150,18 +115,22 @@ pub async fn pull_image(
Ok(())
}

/// Build [`DockerCredentials`] for ghcr.io from a registry token.
/// Build [`DockerCredentials`] for ghcr.io from explicit credentials.
///
/// When `token` is `None` or empty, falls back to the built-in default
/// token (XOR-decoded at runtime). Always returns `Some`.
#[allow(clippy::unnecessary_wraps)]
pub(crate) fn ghcr_credentials(token: Option<&str>) -> Option<DockerCredentials> {
let effective_token = token
.filter(|t| !t.is_empty())
.map_or_else(default_registry_token, ToString::to_string);
/// Returns `None` when `token` is `None` or empty — the default GHCR repos
/// are public and do not require authentication. When a token is provided,
/// uses the given `username` (falling back to `__token__` if `None`/empty).
pub(crate) fn ghcr_credentials(
username: Option<&str>,
token: Option<&str>,
) -> Option<DockerCredentials> {
let token = token.filter(|t| !t.is_empty())?;
let username = username
.filter(|u| !u.is_empty())
.unwrap_or(DEFAULT_REGISTRY_USERNAME);
Some(DockerCredentials {
username: Some(DEFAULT_REGISTRY_USERNAME.to_string()),
password: Some(effective_token),
username: Some(username.to_string()),
password: Some(token.to_string()),
serveraddress: Some(DEFAULT_REGISTRY.to_string()),
..Default::default()
})
Expand All @@ -182,6 +151,7 @@ pub(crate) fn ghcr_credentials(token: Option<&str>) -> Option<DockerCredentials>
pub async fn pull_remote_image(
remote: &Docker,
image_ref: &str,
registry_username: Option<&str>,
registry_token: Option<&str>,
mut on_progress: impl FnMut(String) + Send + 'static,
) -> Result<()> {
Expand Down Expand Up @@ -213,7 +183,7 @@ pub async fn pull_remote_image(
);
on_progress(format!("[progress] Pulling {platform_str} image"));

let credentials = ghcr_credentials(registry_token);
let credentials = ghcr_credentials(registry_username, registry_token);

let options = CreateImageOptions {
from_image: Some(registry_image_base),
Expand Down Expand Up @@ -351,8 +321,8 @@ mod tests {
}

#[test]
fn ghcr_credentials_with_token() {
let creds = ghcr_credentials(Some("ghp_test123"));
fn ghcr_credentials_with_token_default_username() {
let creds = ghcr_credentials(None, Some("ghp_test123"));
assert!(creds.is_some());
let creds = creds.unwrap();
assert_eq!(creds.username.as_deref(), Some("__token__"));
Expand All @@ -361,31 +331,21 @@ mod tests {
}

#[test]
fn ghcr_credentials_without_token_uses_default() {
// When no explicit token is provided, the built-in default is used.
let creds = ghcr_credentials(None).unwrap();
assert_eq!(creds.username.as_deref(), Some("__token__"));
fn ghcr_credentials_with_custom_username() {
let creds = ghcr_credentials(Some("myuser"), Some("ghp_test123"));
assert!(creds.is_some());
let creds = creds.unwrap();
assert_eq!(creds.username.as_deref(), Some("myuser"));
assert_eq!(creds.password.as_deref(), Some("ghp_test123"));
assert_eq!(creds.serveraddress.as_deref(), Some("ghcr.io"));
// The password should be the decoded default token (non-empty).
assert!(creds.password.is_some());
assert!(!creds.password.as_ref().unwrap().is_empty());

// Same for empty string.
let creds2 = ghcr_credentials(Some("")).unwrap();
assert_eq!(creds2.password, creds.password);
}

#[test]
fn xor_decode_default_token() {
let token = default_registry_token();
assert!(
!token.is_empty(),
"default token should decode to non-empty"
);
assert!(
token.chars().all(|c| c.is_ascii_graphic()),
"default token should be printable ASCII"
);
fn ghcr_credentials_without_token_returns_none() {
// No token means unauthenticated (public repos).
assert!(ghcr_credentials(None, None).is_none());
assert!(ghcr_credentials(None, Some("")).is_none());
assert!(ghcr_credentials(Some("myuser"), None).is_none());
}

#[test]
Expand Down
Loading
Loading