Skip to content

WIP: add local sudo capability on devices#317

Draft
michaelw wants to merge 25 commits into
mainfrom
mw/local-sudo-request
Draft

WIP: add local sudo capability on devices#317
michaelw wants to merge 25 commits into
mainfrom
mw/local-sudo-request

Conversation

@michaelw
Copy link
Copy Markdown
Collaborator

@michaelw michaelw commented Apr 20, 2026

Summary

  • add a self-contained frontend/browser E2E baseline for agent login
  • add canonical device_id plumbing plus shared device routing and shared device-definition registries for device-targeted workflows
  • add execution planning for device-targeted provider work, and build local sudo on top of it
  • add a brokered macOS timed-sudo implementation with native privilege services, packaging, and integration coverage

What Changed

Canonical device identity and shared device routing

  • makes device_id the only device identity used for registration, routing, execution planning, and local sudo
  • introduces shared Temporal-backed device registries:
    • agents publish live device routes directly
    • servers publish shared device definitions/policy
  • adds thand config device-id
  • keeps the shared device-registry queue unversioned so the singleton registries can self-heal and remain startup-order tolerant (probably better ways to fix, but I had issues during development with stuck workflows)

Execution planning for device-targeted workflows

  • uses shared device definitions during planning so any server can plan for any configured device
  • keeps authorize/revoke bound to the recorded plan rather than rebuilding routing at execution time
  • adds recorded execution plans for provider tasks that need device-local execution

Local sudo

  • adds local sudo request handling in CLI/API/server/provider flows
  • defaults thand request sudo to the current machine only when --device is omitted
  • adds workflow and provider test coverage for:
    • device-local authorize routing
    • revoke recovery after route loss and return
    • negative cases for missing route vs missing device definition

macOS privilege services

  • adds a native PrivilegeServices workspace with:
    • broker daemon
    • notifier/login item
    • brokerctl client
    • shared lease store / peer-auth / XPC transport
  • adds a Go gRPC bridge and local broker client used by the Darwin local provider
  • routes Darwin timed sudo through the brokered path
  • adds packaging, install/uninstall tooling, CI wiring, signing support, and development docs
    (currently not built in CI until we have Apple Signing secrets sorted out, can be built locally)

Small follow-ups included in the branch

  • adds a frontend integration harness that can run a server container and a second agent container together
  • covers agent login end to end in a real browser flow
  • keeps browser-facing URLs on localhost while using thand.test only for container-to-host reachability
  • guards config push-back sync behind a real configured Thand service
  • adds focused routing/revoke recovery tests
  • stabilizes the Temporal execution-plan caching test harness

Testing

  • go test ./internal/common ./internal/config ./internal/daemon ./cmd/cli ./internal/workflows/tasks/providers/thand ./internal/models ./internal/providers/local ./internal/localbroker
  • go test -tags thand_dev ./internal/common
  • ./scripts/test-macos-privilege-services.sh
  • live macOS e2e validation of:
    • normal timed-sudo run to completion
    • agent stopped before revoke, local broker expiry, then agent restart and workflow convergence
    • server-only negative path (server does not consume device-local work)
    • route-only vs definition-only negative cases
    • server/agent startup in either order with eventual route publication

Notes

  • the shared device registry queue is intentionally unversioned because these singleton control-plane workflows are signaled directly by clients and must remain startup-order tolerant (TODO: better options?)

Unmarshal the full merged config in MergeConfiguration and apply that end state directly instead of routing a sparse diff back through the sync apply path.

Keep applyPatch as the helper for real partial section diffs, and factor shared normalization/store helpers so both flows continue to validate and normalize definitions before persisting them.
Snapshot the current config generation before building the merged sync view, normalize the merged role/workflow/provider definition maps off-lock, and only commit them if the generation is unchanged.

Keep the retry logic scoped to MergeConfiguration, compare and commit definitions only, and detach the snapshot through JSON so stale retries do not alias nested state. Reloaded definitions now bump the generation counter, while broader nested-mutation cleanup remains tracked in #306.
* mw/fix-config-sync-apply:
  Retry config sync on concurrent changes
  Fix synced config application
@github-actions github-actions Bot added the test Adding or updating tests label Apr 20, 2026
@michaelw michaelw force-pushed the mw/local-sudo-request branch 4 times, most recently from 602eaea to 5b76f44 Compare April 23, 2026 21:34
@michaelw michaelw force-pushed the mw/local-sudo-request branch from 5b76f44 to a528efe Compare April 24, 2026 02:47
@michaelw michaelw force-pushed the mw/local-sudo-request branch from a528efe to 1aac00a Compare April 24, 2026 23:41
Comment thread internal/workflows/tasks/providers/thand/approvals.go Fixed
michaelw and others added 2 commits April 24, 2026 19:58
# Conflicts:
#	Makefile
#	internal/config/providers.go
#	internal/config/services/temporal/main.go
#	internal/models/provider_workflows.go
#	internal/workflows/tasks/providers/thand/approvals.go
#	internal/workflows/tasks/providers/thand/authorize.go
#	internal/workflows/tasks/providers/thand/revoke.go
if cancelPresence != nil {
cancelPresence()
}
pendingPresence--
EnsureDeviceRegistryWorkflows and PublishConfiguredDeviceDefinitions call
TemporalClient.GetClient(), which blocks on the readyCh until StartWorkers
closes it. Running them inside SetupTemporal (registration phase) deadlocks
because StartTemporalWorkers is invoked later. Move them to run after
StartWorkers so the client is ready when they execute.
The merge from main reintroduced the pattern where the authorize/revoke
child workflows expect a WorkflowRoleRequest and resolve it to an
AuthorizeRoleRequest inside the workflow via a local activity. The branch
design instead pre-builds the AuthorizeRoleRequest at execution-planning
time and stores it on ExecutionPlanEntry, and the thand task caller
already invokes the child workflow with the pre-built type.

That mismatch caused integration failures with:
  unable to decode the workflow function input payload: cannot unmarshal
  object into Go struct field WorkflowRoleRequest.identity of type string

Restore the branch's signatures so the workflows accept the materialized
request types directly. The BuildAuthorizeRoleRequest activity remains
registered for callers that still build via WorkflowRoleRequest, but the
provider child workflows themselves no longer depend on it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test Adding or updating tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants