Skip to content

Issue workload identities on distributed runners#846

Merged
evanphx merged 3 commits into
mainfrom
mir-1224-on-distributed-runners-identities-are-not-issued
Jun 6, 2026
Merged

Issue workload identities on distributed runners#846
evanphx merged 3 commits into
mainfrom
mir-1224-on-distributed-runners-identities-are-not-issued

Conversation

@evanphx

@evanphx evanphx commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Problem (MIR-1224)

The workload identity issuer holds the cluster's Ed25519 signing key and was only wired into the coordinator's embedded runner. Distributed runners built their sandbox controller with a nil issuer, so their sandboxes never received an identity token, the MIREN_* env vars, token refresh, or the on-demand token request server.

Approach

Runners don't hold the signing key, so they mint tokens by calling the coordinator over RPC.

  • Add WorkloadIssuerInfo and IssueWorkloadToken to the RunnerRegistration service. The coordinator mints tokens on behalf of runners.
  • Introduce a workloadidentity.TokenIssuer interface. The concrete *Issuer satisfies it on the coordinator; distributed runners use a remoteIssuer that proxies minting over RPC. A background loop keeps the issuer URL in sync (re-registration, coordinator restart) and logs only on state transitions.
  • Guard against boxing a typed-nil *Issuer into the interface field (would defeat the controller's != nil checks and panic).

Hardening (coordinator side)

  • Per-sandbox authorization: a runner may only mint tokens for sandboxes scheduled to it, verified via the caller's mTLS client-certificate identity.
  • Full runner ID in the cert CommonName (not an 8-char prefix), so a runner — which chooses its own runner_id at Join — cannot pick a colliding ID and impersonate another runner.
  • App claim derived server-side from the sandbox entity rather than trusted from the caller, and removed from the RPC arguments entirely (it's part of the federated token subject).
  • Startup queries retry transient failures so a brief blip doesn't leave a runner permanently without token issuance.

Testing

make lint clean. New/updated unit tests cover: the TokenIssuer interface migration, runner-registration Join/ownership/app-derivation/cert-name, and the remoteIssuer (URL refresh + enabled-state transitions). All run on host (no containerd).

Not included: an end-to-end blackbox test asserting a distributed sandbox actually receives a token + env vars — it wants the live make dev-distributed peers topology and is tracked as a follow-up.

🤖 Generated with Claude Code

The workload identity issuer holds the cluster's Ed25519 signing key and
was only wired into the coordinator's embedded runner. Distributed runners
built their sandbox controller with a nil issuer, so their sandboxes never
received an identity token, the MIREN_* env vars, token refresh, or the
token request server.

Expose token minting as a coordinator RPC that runners call:

- Add WorkloadIssuerInfo and IssueWorkloadToken to the RunnerRegistration
  service. The coordinator (which holds the signing key) mints tokens on
  behalf of runners.
- Introduce a workloadidentity.TokenIssuer interface. The concrete *Issuer
  satisfies it on the coordinator; distributed runners use a remoteIssuer
  that proxies minting over RPC. A background loop keeps the issuer URL in
  sync (re-registration, coordinator restart) and logs only on transitions.
- Guard against boxing a typed-nil *Issuer into the interface field, which
  would defeat the controller's nil checks.

Authorization and integrity hardening on the coordinator side:

- IssueWorkloadToken is authorized per sandbox: a runner may only mint
  tokens for sandboxes scheduled to it, verified via the caller's mTLS
  client-certificate identity.
- The certificate CommonName embeds the full runner ID (not an 8-char
  prefix) so a runner cannot choose a colliding ID and impersonate another.
- The app claim is derived from the sandbox entity server-side rather than
  trusted from the caller, and dropped from the RPC arguments entirely.

Startup queries retry transient failures so a brief blip does not leave a
runner permanently without token issuance.
@evanphx evanphx requested a review from a team as a code owner June 5, 2026 05:14
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ace84901-d3c7-4f7e-9057-9eef3106d06e

📥 Commits

Reviewing files that changed from the base of the PR and between 0a4a853 and fc5c614.

📒 Files selected for processing (1)
  • controllers/sandbox/sandbox_frozen_test.go
✅ Files skipped from review due to trivial changes (1)
  • controllers/sandbox/sandbox_frozen_test.go

📝 Walkthrough

Walkthrough

Adds a TokenIssuer interface and two RunnerRegistration RPCs: WorkloadIssuerInfo and IssueWorkloadToken. The coordinator registration server gains handlers to report issuer configuration and to mint workload identity JWTs after authorizing caller ownership and resolving the sandbox app. Generated RPC client code and two client methods are added. A new remoteIssuer implements TokenIssuer by calling IssueWorkloadToken (with a per-call timeout) and stores a fixed issuer URL. Runner startup best-effort queries the coordinator and installs a remoteIssuer when enabled; failures are logged and do not abort startup.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/runner/runner.go`:
- Around line 499-515: The WorkloadIssuerInfo loop uses the long-lived ctx so a
hung RPC can stall startup; update the loop to create a per-attempt child
context with a short timeout (use context.WithTimeout) for each call to
regClient.WorkloadIssuerInfo, pass that perAttemptCtx into WorkloadIssuerInfo,
defer/call its cancel immediately after the RPC returns, and preserve existing
retry logic and error handling tied to issuerInfoMaxAttempts and
issuerInfoRetryDelay so each attempt is bounded and the overall retry budget
still applies.

In `@servers/runner/registration_authz_test.go`:
- Around line 112-118: The test currently passes a non-existent sandbox id
("sandbox/does-not-exist") so it never hits the unscheduled branch (sch.Key.Node
== ""); instead create and insert a sandbox record that exists but has an empty
scheduling node, then call srv.authorizeSandboxOwnership with that sandbox id.
In other words, in TestAuthorizeSandboxOwnership_UnscheduledDenied replace the
non-existent id with a sandbox you add to the test server (via
newOwnershipTestServer's store/setup) whose Scheduling/Key.Node is "" so the
authorizeSandboxOwnership(srv, ...) invocation exercises the unscheduled branch.

In `@servers/runner/registration.go`:
- Around line 657-668: resolveSandboxApp may return an empty appName but the
code proceeds to call WorkloadIssuer.IssueTokenWithOptions, producing tokens
with incomplete claims; update the code (around resolveSandboxApp, appName, and
the call to WorkloadIssuer.IssueTokenWithOptions) to validate that appName is
non-empty after calling s.resolveSandboxApp(ctx, sandboxID) and if empty return
an error (or fail the request) before attempting to mint a token for sandboxID;
ensure the early-return includes a clear error/log message identifying
resolveSandboxApp failure so IssueTokenWithOptions is never called with an empty
subject.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a3e0c4d-c4ee-434a-9c93-f472c15e4a83

📥 Commits

Reviewing files that changed from the base of the PR and between f4c8ecf and b04bc9e.

📒 Files selected for processing (12)
  • api/runner/rpc.yml
  • api/runner/runner_v1alpha/rpc.gen.go
  • cli/commands/server.go
  • components/coordinate/coordinate.go
  • components/runner/remote_issuer.go
  • components/runner/remote_issuer_test.go
  • components/runner/runner.go
  • controllers/sandbox/sandbox.go
  • controllers/sandbox/token_server_test.go
  • pkg/workloadidentity/issuer.go
  • servers/runner/registration.go
  • servers/runner/registration_authz_test.go

Comment thread components/runner/runner.go
Comment thread servers/runner/registration_authz_test.go Outdated
Comment on lines +657 to +668
appName := s.resolveSandboxApp(ctx, sandboxID)

opts := workloadidentity.TokenOptions{}
if args.HasAudience() {
opts.Audience = args.Audience()
}
if args.HasTtlSeconds() && args.TtlSeconds() > 0 {
opts.TTL = time.Duration(args.TtlSeconds()) * time.Second
}

token, err := s.WorkloadIssuer.IssueTokenWithOptions(appName, sandboxID, opts)
if err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when sandbox app resolution fails before minting token.

Line 657 can return an empty app name, but Line 667 still issues a token. That can produce tokens with an incomplete subject/claim set and weakens the server-side app derivation guarantee.

Suggested fix
 	appName := s.resolveSandboxApp(ctx, sandboxID)
+	if appName == "" {
+		s.Log.Warn("workload token request denied",
+			"sandbox", sandboxID,
+			"error", "failed to resolve app identity")
+		results.SetError("failed to resolve app for sandbox")
+		return nil
+	}
 
 	opts := workloadidentity.TokenOptions{}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@servers/runner/registration.go` around lines 657 - 668, resolveSandboxApp may
return an empty appName but the code proceeds to call
WorkloadIssuer.IssueTokenWithOptions, producing tokens with incomplete claims;
update the code (around resolveSandboxApp, appName, and the call to
WorkloadIssuer.IssueTokenWithOptions) to validate that appName is non-empty
after calling s.resolveSandboxApp(ctx, sandboxID) and if empty return an error
(or fail the request) before attempting to mint a token for sandboxID; ensure
the early-return includes a clear error/log message identifying
resolveSandboxApp failure so IssueTokenWithOptions is never called with an empty
subject.

@phinze phinze left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the shape. "Runners can't sign, so they ask the coordinator to" is the right call.

On the bot notes: +1 on the startup timeout (runner.go ~501). Everything else here bounds itself with remoteTokenTimeout, but this one rides the raw runner ctx, so a hung coordinator just stalls startup. That's the one worth fixing here. +1 on the UnscheduledDenied test nit too. We'd skip the fail-closed-on-empty-app suggestion though, since the local path mints with an empty app and appless sandboxes (system/buildkit) are real, so matching local feels better.

Two inline questions, neither blocking. Filed MIR-1225 for the runner_id one. Nice work!

Comment thread components/runner/remote_issuer.go Outdated
}

enabled := info.Enabled()
if r.setEnabled(enabled) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staring at the transition handling here and wondering if it can be simplified. The coordinator only picks up an issuer at startup, so enabled and the URL are basically frozen for the process, and the loop plus enabled flag may be more than we need. Not a biggie though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup! done.

// runner pick an ID whose prefix collides with a victim's cert name and mint
// tokens for the victim's sandboxes. A runner ID is a UUID, so "runner-" plus
// the ID stays within the certificate CommonName length limit.
func runnerCertName(runnerID string) string {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full runner ID is right for the accidental-collision case. One thing worth flagging now that the ownership check leans on this CN: the runner_id behind it is whatever the caller handed Join, and Join Puts the node by ident, which overwrites rather than rejects. Since runner_ids aren't secret, a reusable-invite holder could Join as a known runner's id, get a cert with its CN, and mint for its sandboxes (and those identities federate outward). Fix belongs in Join, not here, so it's a follow-up, filed as MIR-1225.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

- Bound each WorkloadIssuerInfo startup attempt with a per-attempt timeout
  (queryWorkloadIssuerInfo) so a hung coordinator RPC can't stall runner
  startup and the retry budget can expire.
- Fix TestAuthorizeSandboxOwnership_UnscheduledDenied to insert a sandbox
  that exists but has no scheduling node, exercising the sch.Key.Node == ""
  branch; the missing-entity case is now its own test.
- Simplify remoteIssuer: the coordinator's issuer URL is fixed for a process
  lifetime, so drop the background refresh loop and enabled-transition flag
  and cache the URL captured at startup.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/runner/remote_issuer.go (1)

57-59: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve sub-second TTL requests instead of truncating to zero.

On Line 58, opts.TTL / time.Second truncates toward zero, so any positive TTL below 1s becomes 0, and the coordinator treats it as “no TTL provided.” This silently changes caller intent.

Proposed fix
 	var ttlSeconds int64
 	if opts.TTL > 0 {
-		ttlSeconds = int64(opts.TTL / time.Second)
+		ttlSeconds = int64((opts.TTL + time.Second - 1) / time.Second) // ceil to preserve positive TTL intent
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/runner/remote_issuer.go` around lines 57 - 59, The TTL calculation
truncates sub-second durations to zero; change the computation in
remote_issuer.go to round up instead of truncating (preserve any positive
sub-second TTL as at least 1 second). Replace the current int64(opts.TTL /
time.Second) logic for ttlSeconds with an expression using math.Ceil on
opts.TTL.Seconds(), e.g. ttlSeconds = int64(math.Ceil(opts.TTL.Seconds())), and
add the math import so opts.TTL, ttlSeconds and the surrounding function use the
new rounded-up value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@components/runner/remote_issuer.go`:
- Around line 57-59: The TTL calculation truncates sub-second durations to zero;
change the computation in remote_issuer.go to round up instead of truncating
(preserve any positive sub-second TTL as at least 1 second). Replace the current
int64(opts.TTL / time.Second) logic for ttlSeconds with an expression using
math.Ceil on opts.TTL.Seconds(), e.g. ttlSeconds =
int64(math.Ceil(opts.TTL.Seconds())), and add the math import so opts.TTL,
ttlSeconds and the surrounding function use the new rounded-up value.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f44cc92d-137d-4e38-92d9-6f423932d78e

📥 Commits

Reviewing files that changed from the base of the PR and between b04bc9e and 0a4a853.

📒 Files selected for processing (4)
  • components/runner/remote_issuer.go
  • components/runner/remote_issuer_test.go
  • components/runner/runner.go
  • servers/runner/registration_authz_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • components/runner/remote_issuer_test.go
  • components/runner/runner.go
  • servers/runner/registration_authz_test.go

The WorkloadIssuer field type change (concrete *Issuer -> TokenIssuer
interface) altered sandbox.go's contents. The change is a pure type
migration on the shared SandboxController; the saga controller wraps that
same inner controller and inherits it, so there is no behavioral
divergence to mirror. Update the recorded hash.
@evanphx evanphx merged commit 41aacf3 into main Jun 6, 2026
19 checks passed
@evanphx evanphx deleted the mir-1224-on-distributed-runners-identities-are-not-issued branch June 6, 2026 20:28
@miren-code-agent miren-code-agent Bot mentioned this pull request Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants