Issue workload identities on distributed runners#846
Conversation
The workload identity issuer holds the cluster's Ed25519 signing key and was only wired into the coordinator's embedded runner. Distributed runners built their sandbox controller with a nil issuer, so their sandboxes never received an identity token, the MIREN_* env vars, token refresh, or the token request server. Expose token minting as a coordinator RPC that runners call: - Add WorkloadIssuerInfo and IssueWorkloadToken to the RunnerRegistration service. The coordinator (which holds the signing key) mints tokens on behalf of runners. - Introduce a workloadidentity.TokenIssuer interface. The concrete *Issuer satisfies it on the coordinator; distributed runners use a remoteIssuer that proxies minting over RPC. A background loop keeps the issuer URL in sync (re-registration, coordinator restart) and logs only on transitions. - Guard against boxing a typed-nil *Issuer into the interface field, which would defeat the controller's nil checks. Authorization and integrity hardening on the coordinator side: - IssueWorkloadToken is authorized per sandbox: a runner may only mint tokens for sandboxes scheduled to it, verified via the caller's mTLS client-certificate identity. - The certificate CommonName embeds the full runner ID (not an 8-char prefix) so a runner cannot choose a colliding ID and impersonate another. - The app claim is derived from the sandbox entity server-side rather than trusted from the caller, and dropped from the RPC arguments entirely. Startup queries retry transient failures so a brief blip does not leave a runner permanently without token issuance.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a TokenIssuer interface and two RunnerRegistration RPCs: WorkloadIssuerInfo and IssueWorkloadToken. The coordinator registration server gains handlers to report issuer configuration and to mint workload identity JWTs after authorizing caller ownership and resolving the sandbox app. Generated RPC client code and two client methods are added. A new remoteIssuer implements TokenIssuer by calling IssueWorkloadToken (with a per-call timeout) and stores a fixed issuer URL. Runner startup best-effort queries the coordinator and installs a remoteIssuer when enabled; failures are logged and do not abort startup. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/runner/runner.go`:
- Around line 499-515: The WorkloadIssuerInfo loop uses the long-lived ctx so a
hung RPC can stall startup; update the loop to create a per-attempt child
context with a short timeout (use context.WithTimeout) for each call to
regClient.WorkloadIssuerInfo, pass that perAttemptCtx into WorkloadIssuerInfo,
defer/call its cancel immediately after the RPC returns, and preserve existing
retry logic and error handling tied to issuerInfoMaxAttempts and
issuerInfoRetryDelay so each attempt is bounded and the overall retry budget
still applies.
In `@servers/runner/registration_authz_test.go`:
- Around line 112-118: The test currently passes a non-existent sandbox id
("sandbox/does-not-exist") so it never hits the unscheduled branch (sch.Key.Node
== ""); instead create and insert a sandbox record that exists but has an empty
scheduling node, then call srv.authorizeSandboxOwnership with that sandbox id.
In other words, in TestAuthorizeSandboxOwnership_UnscheduledDenied replace the
non-existent id with a sandbox you add to the test server (via
newOwnershipTestServer's store/setup) whose Scheduling/Key.Node is "" so the
authorizeSandboxOwnership(srv, ...) invocation exercises the unscheduled branch.
In `@servers/runner/registration.go`:
- Around line 657-668: resolveSandboxApp may return an empty appName but the
code proceeds to call WorkloadIssuer.IssueTokenWithOptions, producing tokens
with incomplete claims; update the code (around resolveSandboxApp, appName, and
the call to WorkloadIssuer.IssueTokenWithOptions) to validate that appName is
non-empty after calling s.resolveSandboxApp(ctx, sandboxID) and if empty return
an error (or fail the request) before attempting to mint a token for sandboxID;
ensure the early-return includes a clear error/log message identifying
resolveSandboxApp failure so IssueTokenWithOptions is never called with an empty
subject.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0a3e0c4d-c4ee-434a-9c93-f472c15e4a83
📒 Files selected for processing (12)
api/runner/rpc.ymlapi/runner/runner_v1alpha/rpc.gen.gocli/commands/server.gocomponents/coordinate/coordinate.gocomponents/runner/remote_issuer.gocomponents/runner/remote_issuer_test.gocomponents/runner/runner.gocontrollers/sandbox/sandbox.gocontrollers/sandbox/token_server_test.gopkg/workloadidentity/issuer.goservers/runner/registration.goservers/runner/registration_authz_test.go
| appName := s.resolveSandboxApp(ctx, sandboxID) | ||
|
|
||
| opts := workloadidentity.TokenOptions{} | ||
| if args.HasAudience() { | ||
| opts.Audience = args.Audience() | ||
| } | ||
| if args.HasTtlSeconds() && args.TtlSeconds() > 0 { | ||
| opts.TTL = time.Duration(args.TtlSeconds()) * time.Second | ||
| } | ||
|
|
||
| token, err := s.WorkloadIssuer.IssueTokenWithOptions(appName, sandboxID, opts) | ||
| if err != nil { |
There was a problem hiding this comment.
Fail closed when sandbox app resolution fails before minting token.
Line 657 can return an empty app name, but Line 667 still issues a token. That can produce tokens with an incomplete subject/claim set and weakens the server-side app derivation guarantee.
Suggested fix
appName := s.resolveSandboxApp(ctx, sandboxID)
+ if appName == "" {
+ s.Log.Warn("workload token request denied",
+ "sandbox", sandboxID,
+ "error", "failed to resolve app identity")
+ results.SetError("failed to resolve app for sandbox")
+ return nil
+ }
opts := workloadidentity.TokenOptions{}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@servers/runner/registration.go` around lines 657 - 668, resolveSandboxApp may
return an empty appName but the code proceeds to call
WorkloadIssuer.IssueTokenWithOptions, producing tokens with incomplete claims;
update the code (around resolveSandboxApp, appName, and the call to
WorkloadIssuer.IssueTokenWithOptions) to validate that appName is non-empty
after calling s.resolveSandboxApp(ctx, sandboxID) and if empty return an error
(or fail the request) before attempting to mint a token for sandboxID; ensure
the early-return includes a clear error/log message identifying
resolveSandboxApp failure so IssueTokenWithOptions is never called with an empty
subject.
phinze
left a comment
There was a problem hiding this comment.
Really like the shape. "Runners can't sign, so they ask the coordinator to" is the right call.
On the bot notes: +1 on the startup timeout (runner.go ~501). Everything else here bounds itself with remoteTokenTimeout, but this one rides the raw runner ctx, so a hung coordinator just stalls startup. That's the one worth fixing here. +1 on the UnscheduledDenied test nit too. We'd skip the fail-closed-on-empty-app suggestion though, since the local path mints with an empty app and appless sandboxes (system/buildkit) are real, so matching local feels better.
Two inline questions, neither blocking. Filed MIR-1225 for the runner_id one. Nice work!
| } | ||
|
|
||
| enabled := info.Enabled() | ||
| if r.setEnabled(enabled) { |
There was a problem hiding this comment.
Staring at the transition handling here and wondering if it can be simplified. The coordinator only picks up an issuer at startup, so enabled and the URL are basically frozen for the process, and the loop plus enabled flag may be more than we need. Not a biggie though.
| // runner pick an ID whose prefix collides with a victim's cert name and mint | ||
| // tokens for the victim's sandboxes. A runner ID is a UUID, so "runner-" plus | ||
| // the ID stays within the certificate CommonName length limit. | ||
| func runnerCertName(runnerID string) string { |
There was a problem hiding this comment.
Full runner ID is right for the accidental-collision case. One thing worth flagging now that the ownership check leans on this CN: the runner_id behind it is whatever the caller handed Join, and Join Puts the node by ident, which overwrites rather than rejects. Since runner_ids aren't secret, a reusable-invite holder could Join as a known runner's id, get a cert with its CN, and mint for its sandboxes (and those identities federate outward). Fix belongs in Join, not here, so it's a follow-up, filed as MIR-1225.
- Bound each WorkloadIssuerInfo startup attempt with a per-attempt timeout (queryWorkloadIssuerInfo) so a hung coordinator RPC can't stall runner startup and the retry budget can expire. - Fix TestAuthorizeSandboxOwnership_UnscheduledDenied to insert a sandbox that exists but has no scheduling node, exercising the sch.Key.Node == "" branch; the missing-entity case is now its own test. - Simplify remoteIssuer: the coordinator's issuer URL is fixed for a process lifetime, so drop the background refresh loop and enabled-transition flag and cache the URL captured at startup.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/runner/remote_issuer.go (1)
57-59:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPreserve sub-second TTL requests instead of truncating to zero.
On Line 58,
opts.TTL / time.Secondtruncates toward zero, so any positive TTL below 1s becomes0, and the coordinator treats it as “no TTL provided.” This silently changes caller intent.Proposed fix
var ttlSeconds int64 if opts.TTL > 0 { - ttlSeconds = int64(opts.TTL / time.Second) + ttlSeconds = int64((opts.TTL + time.Second - 1) / time.Second) // ceil to preserve positive TTL intent }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/runner/remote_issuer.go` around lines 57 - 59, The TTL calculation truncates sub-second durations to zero; change the computation in remote_issuer.go to round up instead of truncating (preserve any positive sub-second TTL as at least 1 second). Replace the current int64(opts.TTL / time.Second) logic for ttlSeconds with an expression using math.Ceil on opts.TTL.Seconds(), e.g. ttlSeconds = int64(math.Ceil(opts.TTL.Seconds())), and add the math import so opts.TTL, ttlSeconds and the surrounding function use the new rounded-up value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@components/runner/remote_issuer.go`:
- Around line 57-59: The TTL calculation truncates sub-second durations to zero;
change the computation in remote_issuer.go to round up instead of truncating
(preserve any positive sub-second TTL as at least 1 second). Replace the current
int64(opts.TTL / time.Second) logic for ttlSeconds with an expression using
math.Ceil on opts.TTL.Seconds(), e.g. ttlSeconds =
int64(math.Ceil(opts.TTL.Seconds())), and add the math import so opts.TTL,
ttlSeconds and the surrounding function use the new rounded-up value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f44cc92d-137d-4e38-92d9-6f423932d78e
📒 Files selected for processing (4)
components/runner/remote_issuer.gocomponents/runner/remote_issuer_test.gocomponents/runner/runner.goservers/runner/registration_authz_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- components/runner/remote_issuer_test.go
- components/runner/runner.go
- servers/runner/registration_authz_test.go
The WorkloadIssuer field type change (concrete *Issuer -> TokenIssuer interface) altered sandbox.go's contents. The change is a pure type migration on the shared SandboxController; the saga controller wraps that same inner controller and inherits it, so there is no behavioral divergence to mirror. Update the recorded hash.
Problem (MIR-1224)
The workload identity issuer holds the cluster's Ed25519 signing key and was only wired into the coordinator's embedded runner. Distributed runners built their sandbox controller with a
nilissuer, so their sandboxes never received an identity token, theMIREN_*env vars, token refresh, or the on-demand token request server.Approach
Runners don't hold the signing key, so they mint tokens by calling the coordinator over RPC.
WorkloadIssuerInfoandIssueWorkloadTokento theRunnerRegistrationservice. The coordinator mints tokens on behalf of runners.workloadidentity.TokenIssuerinterface. The concrete*Issuersatisfies it on the coordinator; distributed runners use aremoteIssuerthat proxies minting over RPC. A background loop keeps the issuer URL in sync (re-registration, coordinator restart) and logs only on state transitions.*Issuerinto the interface field (would defeat the controller's!= nilchecks and panic).Hardening (coordinator side)
runner_idat Join — cannot pick a colliding ID and impersonate another runner.Testing
make lintclean. New/updated unit tests cover: theTokenIssuerinterface migration, runner-registration Join/ownership/app-derivation/cert-name, and theremoteIssuer(URL refresh + enabled-state transitions). All run on host (no containerd).Not included: an end-to-end blackbox test asserting a distributed sandbox actually receives a token + env vars — it wants the live
make dev-distributedpeers topology and is tracked as a follow-up.🤖 Generated with Claude Code