Issue workload identities on distributed runners by evanphx · Pull Request #846 · mirendev/runtime

evanphx · 2026-06-05T05:14:27Z

Problem (MIR-1224)

The workload identity issuer holds the cluster's Ed25519 signing key and was only wired into the coordinator's embedded runner. Distributed runners built their sandbox controller with a nil issuer, so their sandboxes never received an identity token, the MIREN_* env vars, token refresh, or the on-demand token request server.

Approach

Runners don't hold the signing key, so they mint tokens by calling the coordinator over RPC.

Add WorkloadIssuerInfo and IssueWorkloadToken to the RunnerRegistration service. The coordinator mints tokens on behalf of runners.
Introduce a workloadidentity.TokenIssuer interface. The concrete *Issuer satisfies it on the coordinator; distributed runners use a remoteIssuer that proxies minting over RPC. A background loop keeps the issuer URL in sync (re-registration, coordinator restart) and logs only on state transitions.
Guard against boxing a typed-nil *Issuer into the interface field (would defeat the controller's != nil checks and panic).

Hardening (coordinator side)

Per-sandbox authorization: a runner may only mint tokens for sandboxes scheduled to it, verified via the caller's mTLS client-certificate identity.
Full runner ID in the cert CommonName (not an 8-char prefix), so a runner — which chooses its own runner_id at Join — cannot pick a colliding ID and impersonate another runner.
App claim derived server-side from the sandbox entity rather than trusted from the caller, and removed from the RPC arguments entirely (it's part of the federated token subject).
Startup queries retry transient failures so a brief blip doesn't leave a runner permanently without token issuance.

Testing

make lint clean. New/updated unit tests cover: the TokenIssuer interface migration, runner-registration Join/ownership/app-derivation/cert-name, and the remoteIssuer (URL refresh + enabled-state transitions). All run on host (no containerd).

Not included: an end-to-end blackbox test asserting a distributed sandbox actually receives a token + env vars — it wants the live make dev-distributed peers topology and is tracked as a follow-up.

🤖 Generated with Claude Code

The workload identity issuer holds the cluster's Ed25519 signing key and was only wired into the coordinator's embedded runner. Distributed runners built their sandbox controller with a nil issuer, so their sandboxes never received an identity token, the MIREN_* env vars, token refresh, or the token request server. Expose token minting as a coordinator RPC that runners call: - Add WorkloadIssuerInfo and IssueWorkloadToken to the RunnerRegistration service. The coordinator (which holds the signing key) mints tokens on behalf of runners. - Introduce a workloadidentity.TokenIssuer interface. The concrete *Issuer satisfies it on the coordinator; distributed runners use a remoteIssuer that proxies minting over RPC. A background loop keeps the issuer URL in sync (re-registration, coordinator restart) and logs only on transitions. - Guard against boxing a typed-nil *Issuer into the interface field, which would defeat the controller's nil checks. Authorization and integrity hardening on the coordinator side: - IssueWorkloadToken is authorized per sandbox: a runner may only mint tokens for sandboxes scheduled to it, verified via the caller's mTLS client-certificate identity. - The certificate CommonName embeds the full runner ID (not an 8-char prefix) so a runner cannot choose a colliding ID and impersonate another. - The app claim is derived from the sandbox entity server-side rather than trusted from the caller, and dropped from the RPC arguments entirely. Startup queries retry transient failures so a brief blip does not leave a runner permanently without token issuance.

coderabbitai · 2026-06-05T05:14:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ace84901-d3c7-4f7e-9057-9eef3106d06e

📥 Commits

Reviewing files that changed from the base of the PR and between 0a4a853 and fc5c614.

📒 Files selected for processing (1)

controllers/sandbox/sandbox_frozen_test.go

✅ Files skipped from review due to trivial changes (1)

controllers/sandbox/sandbox_frozen_test.go

📝 Walkthrough

Walkthrough

Adds a TokenIssuer interface and two RunnerRegistration RPCs: WorkloadIssuerInfo and IssueWorkloadToken. The coordinator registration server gains handlers to report issuer configuration and to mint workload identity JWTs after authorizing caller ownership and resolving the sandbox app. Generated RPC client code and two client methods are added. A new remoteIssuer implements TokenIssuer by calling IssueWorkloadToken (with a per-call timeout) and stores a fixed issuer URL. Runner startup best-effort queries the coordinator and installs a remoteIssuer when enabled; failures are logged and do not abort startup.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/runner/runner.go`:
- Around line 499-515: The WorkloadIssuerInfo loop uses the long-lived ctx so a
hung RPC can stall startup; update the loop to create a per-attempt child
context with a short timeout (use context.WithTimeout) for each call to
regClient.WorkloadIssuerInfo, pass that perAttemptCtx into WorkloadIssuerInfo,
defer/call its cancel immediately after the RPC returns, and preserve existing
retry logic and error handling tied to issuerInfoMaxAttempts and
issuerInfoRetryDelay so each attempt is bounded and the overall retry budget
still applies.

In `@servers/runner/registration_authz_test.go`:
- Around line 112-118: The test currently passes a non-existent sandbox id
("sandbox/does-not-exist") so it never hits the unscheduled branch (sch.Key.Node
== ""); instead create and insert a sandbox record that exists but has an empty
scheduling node, then call srv.authorizeSandboxOwnership with that sandbox id.
In other words, in TestAuthorizeSandboxOwnership_UnscheduledDenied replace the
non-existent id with a sandbox you add to the test server (via
newOwnershipTestServer's store/setup) whose Scheduling/Key.Node is "" so the
authorizeSandboxOwnership(srv, ...) invocation exercises the unscheduled branch.

In `@servers/runner/registration.go`:
- Around line 657-668: resolveSandboxApp may return an empty appName but the
code proceeds to call WorkloadIssuer.IssueTokenWithOptions, producing tokens
with incomplete claims; update the code (around resolveSandboxApp, appName, and
the call to WorkloadIssuer.IssueTokenWithOptions) to validate that appName is
non-empty after calling s.resolveSandboxApp(ctx, sandboxID) and if empty return
an error (or fail the request) before attempting to mint a token for sandboxID;
ensure the early-return includes a clear error/log message identifying
resolveSandboxApp failure so IssueTokenWithOptions is never called with an empty
subject.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a3e0c4d-c4ee-434a-9c93-f472c15e4a83

📥 Commits

Reviewing files that changed from the base of the PR and between f4c8ecf and b04bc9e.

📒 Files selected for processing (12)

api/runner/rpc.yml
api/runner/runner_v1alpha/rpc.gen.go
cli/commands/server.go
components/coordinate/coordinate.go
components/runner/remote_issuer.go
components/runner/remote_issuer_test.go
components/runner/runner.go
controllers/sandbox/sandbox.go
controllers/sandbox/token_server_test.go
pkg/workloadidentity/issuer.go
servers/runner/registration.go
servers/runner/registration_authz_test.go

coderabbitai · 2026-06-05T05:20:58Z

+	appName := s.resolveSandboxApp(ctx, sandboxID)
+
+	opts := workloadidentity.TokenOptions{}
+	if args.HasAudience() {
+		opts.Audience = args.Audience()
+	}
+	if args.HasTtlSeconds() && args.TtlSeconds() > 0 {
+		opts.TTL = time.Duration(args.TtlSeconds()) * time.Second
+	}
+
+	token, err := s.WorkloadIssuer.IssueTokenWithOptions(appName, sandboxID, opts)
+	if err != nil {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when sandbox app resolution fails before minting token.

Line 657 can return an empty app name, but Line 667 still issues a token. That can produce tokens with an incomplete subject/claim set and weakens the server-side app derivation guarantee.

Suggested fix

appName := s.resolveSandboxApp(ctx, sandboxID) + if appName == "" { + s.Log.Warn("workload token request denied", + "sandbox", sandboxID, + "error", "failed to resolve app identity") + results.SetError("failed to resolve app for sandbox") + return nil + } opts := workloadidentity.TokenOptions{}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@servers/runner/registration.go` around lines 657 - 668, resolveSandboxApp may return an empty appName but the code proceeds to call WorkloadIssuer.IssueTokenWithOptions, producing tokens with incomplete claims; update the code (around resolveSandboxApp, appName, and the call to WorkloadIssuer.IssueTokenWithOptions) to validate that appName is non-empty after calling s.resolveSandboxApp(ctx, sandboxID) and if empty return an error (or fail the request) before attempting to mint a token for sandboxID; ensure the early-return includes a clear error/log message identifying resolveSandboxApp failure so IssueTokenWithOptions is never called with an empty subject.

phinze

Really like the shape. "Runners can't sign, so they ask the coordinator to" is the right call.

On the bot notes: +1 on the startup timeout (runner.go ~501). Everything else here bounds itself with remoteTokenTimeout, but this one rides the raw runner ctx, so a hung coordinator just stalls startup. That's the one worth fixing here. +1 on the UnscheduledDenied test nit too. We'd skip the fail-closed-on-empty-app suggestion though, since the local path mints with an empty app and appless sandboxes (system/buildkit) are real, so matching local feels better.

Two inline questions, neither blocking. Filed MIR-1225 for the runner_id one. Nice work!

phinze · 2026-06-05T14:12:01Z

+	}
+
+	enabled := info.Enabled()
+	if r.setEnabled(enabled) {


Staring at the transition handling here and wondering if it can be simplified. The coordinator only picks up an issuer at startup, so enabled and the URL are basically frozen for the process, and the loop plus enabled flag may be more than we need. Not a biggie though.

phinze · 2026-06-05T14:12:01Z

+// runner pick an ID whose prefix collides with a victim's cert name and mint
+// tokens for the victim's sandboxes. A runner ID is a UUID, so "runner-" plus
+// the ID stays within the certificate CommonName length limit.
+func runnerCertName(runnerID string) string {


Full runner ID is right for the accidental-collision case. One thing worth flagging now that the ownership check leans on this CN: the runner_id behind it is whatever the caller handed Join, and Join Puts the node by ident, which overwrites rather than rejects. Since runner_ids aren't secret, a reusable-invite holder could Join as a known runner's id, get a cert with its CN, and mint for its sandboxes (and those identities federate outward). Fix belongs in Join, not here, so it's a follow-up, filed as MIR-1225.

- Bound each WorkloadIssuerInfo startup attempt with a per-attempt timeout (queryWorkloadIssuerInfo) so a hung coordinator RPC can't stall runner startup and the retry budget can expire. - Fix TestAuthorizeSandboxOwnership_UnscheduledDenied to insert a sandbox that exists but has no scheduling node, exercising the sch.Key.Node == "" branch; the missing-entity case is now its own test. - Simplify remoteIssuer: the coordinator's issuer URL is fixed for a process lifetime, so drop the background refresh loop and enabled-transition flag and cache the URL captured at startup.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/runner/remote_issuer.go (1)

57-59: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve sub-second TTL requests instead of truncating to zero.

On Line 58, opts.TTL / time.Second truncates toward zero, so any positive TTL below 1s becomes 0, and the coordinator treats it as “no TTL provided.” This silently changes caller intent.

Proposed fix

 	var ttlSeconds int64
 	if opts.TTL > 0 {
-		ttlSeconds = int64(opts.TTL / time.Second)
+		ttlSeconds = int64((opts.TTL + time.Second - 1) / time.Second) // ceil to preserve positive TTL intent
 	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/runner/remote_issuer.go` around lines 57 - 59, The TTL calculation
truncates sub-second durations to zero; change the computation in
remote_issuer.go to round up instead of truncating (preserve any positive
sub-second TTL as at least 1 second). Replace the current int64(opts.TTL /
time.Second) logic for ttlSeconds with an expression using math.Ceil on
opts.TTL.Seconds(), e.g. ttlSeconds = int64(math.Ceil(opts.TTL.Seconds())), and
add the math import so opts.TTL, ttlSeconds and the surrounding function use the
new rounded-up value.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@components/runner/remote_issuer.go`:
- Around line 57-59: The TTL calculation truncates sub-second durations to zero;
change the computation in remote_issuer.go to round up instead of truncating
(preserve any positive sub-second TTL as at least 1 second). Replace the current
int64(opts.TTL / time.Second) logic for ttlSeconds with an expression using
math.Ceil on opts.TTL.Seconds(), e.g. ttlSeconds =
int64(math.Ceil(opts.TTL.Seconds())), and add the math import so opts.TTL,
ttlSeconds and the surrounding function use the new rounded-up value.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f44cc92d-137d-4e38-92d9-6f423932d78e

📥 Commits

Reviewing files that changed from the base of the PR and between b04bc9e and 0a4a853.

📒 Files selected for processing (4)

components/runner/remote_issuer.go
components/runner/remote_issuer_test.go
components/runner/runner.go
servers/runner/registration_authz_test.go

🚧 Files skipped from review as they are similar to previous changes (3)

components/runner/remote_issuer_test.go
components/runner/runner.go
servers/runner/registration_authz_test.go

The WorkloadIssuer field type change (concrete *Issuer -> TokenIssuer interface) altered sandbox.go's contents. The change is a pure type migration on the shared SandboxController; the saga controller wraps that same inner controller and inherits it, so there is no behavioral divergence to mirror. Update the recorded hash.

evanphx requested a review from a team as a code owner June 5, 2026 05:14

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

phinze approved these changes Jun 5, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

evanphx merged commit 41aacf3 into main Jun 6, 2026
19 checks passed

evanphx deleted the mir-1224-on-distributed-runners-identities-are-not-issued branch June 6, 2026 20:28

miren-code-agent Bot mentioned this pull request Jun 9, 2026

Release v0.10.0 #853

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue workload identities on distributed runners#846

Issue workload identities on distributed runners#846
evanphx merged 3 commits into
mainfrom
mir-1224-on-distributed-runners-identities-are-not-issued

evanphx commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

phinze left a comment

Uh oh!

phinze Jun 5, 2026

Uh oh!

evanphx Jun 6, 2026

Uh oh!

phinze Jun 5, 2026

Uh oh!

evanphx Jun 6, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

evanphx commented Jun 5, 2026

Problem (MIR-1224)

Approach

Hardening (coordinator side)

Testing

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

phinze left a comment

Choose a reason for hiding this comment

Uh oh!

phinze Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

evanphx Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

phinze Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

evanphx Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading