Skip to content

RFC: fix config sync application and retry semantics#309

Draft
michaelw wants to merge 2 commits into
thand-io:mainfrom
michaelw:mw/fix-config-sync-apply
Draft

RFC: fix config sync application and retry semantics#309
michaelw wants to merge 2 commits into
thand-io:mainfrom
michaelw:mw/fix-config-sync-apply

Conversation

@michaelw
Copy link
Copy Markdown
Collaborator

@michaelw michaelw commented Apr 15, 2026

This fixes a pre-existing config sync bug where server-synced role, workflow, and provider definitions could be validated without becoming the live in-memory config, and then hardens that merged sync path against stale-snapshot lost updates.

How to Read

  • Look at the test results (comments below)
  • start with the first commit only; two concerns: (1) Apply* results get thrown away?, and (2) sparse diffs would cause zeroed values that the validation would throw away anyway.

If that makes sense:

  • second commit highlights a potential concurrency issue (lost updates when switching from RLock to Lock sections).

What changed

  • apply the full merged server state directly in MergeConfiguration, instead of routing that merged state back through the partial-patch helper
  • keep applyPatch as the helper for real partial section diffs, with shared normalize/store helpers underneath
  • add generation-based retry around the merged sync path only, with definitions-only compare/commit and detached snapshots for retry isolation

Residual gap

Definition maps are now treated as immutable snapshots for sync and reload purposes, but broader nested-mutation and lock-discipline cleanup is still tracked separately in #306. The skipped evidence tests in this branch document that remaining gap without widening this PR.

Validation

  • go test ./internal/config
  • go test ./...

@michaelw
Copy link
Copy Markdown
Collaborator Author

Upstream red/green evidence for the sync application fix:

On untouched upstream/main, I overlaid the first-commit regression test file from this branch and ran:

go test ./internal/config -run 'TestMergeConfiguration_|TestApplyPatch_'

That repro fails on the pre-fix code with the expected sync-application breakages:

--- FAIL: TestMergeConfiguration_ServerSendsNewRoles
    expected synced role to be stored locally

--- FAIL: TestMergeConfiguration_ServerSendsUpdatedRole
    expected: "Can edit and publish"
    actual  : "Can edit"

--- FAIL: TestMergeConfiguration_ServerSendsNewWorkflows
    expected synced workflow to be stored locally

--- FAIL: TestMergeConfiguration_ServerSendsUpdatedWorkflow
    expected: "Updated workflow"
    actual  : "Existing workflow"

--- FAIL: TestMergeConfiguration_ServerSendsNewProviders
    expected synced provider to be stored locally

--- FAIL: TestMergeConfiguration_ServerSendsUpdatedProvider
    expected: "Updated provider description"
    actual  : "Old provider description"

--- FAIL: TestMergeConfiguration_PartialConfig_OnlyRoles
    expected synced role to be stored locally

--- FAIL: TestApplyPatch_AppliesRoles
    map[string]models.Role(nil) does not contain "new-role"

Representative constructed fixtures from those tests:

// new role
local := Roles.Definitions = nil
server := RegistrationResponse{
  Roles: {Definitions: {"admin": {Name: "admin", Enabled: true}}},
}
// expected after MergeConfiguration:
// Roles.Definitions == {"admin": {Name: "admin", Enabled: true}}
// updated role
local := Roles.Definitions = {
  "editor":    {Name: "editor", Description: "Can edit", Enabled: true},
  "untouched": {Name: "untouched", Description: "Keep me", Enabled: true},
}
server := RegistrationResponse{
  Roles: {Definitions: {"editor": {Name: "editor", Description: "Can edit and publish", Enabled: true}}},
}
// expected after MergeConfiguration:
// editor.Description == "Can edit and publish"
// untouched still present
// new workflow
local := Workflows.Definitions = nil
server := RegistrationResponse{
  Workflows: {Definitions: {"approval": makeTestWorkflow("approval", "Handles approvals")}},
}
// expected after MergeConfiguration:
// Workflows.Definitions["approval"].Description == "Handles approvals"
// updated provider
local := Providers.Definitions = {
  "mock-primary": {Name: "mock-primary", Description: "Old provider description", Provider: "mock", Enabled: true},
  "mock-extra":   {Name: "mock-extra", Description: "Keep me", Provider: "mock", Enabled: true},
}
server := RegistrationResponse{
  Providers: {Definitions: {"mock-primary": {Name: "mock-primary", Description: "Updated provider description", Provider: "mock", Enabled: true}}},
}
// expected after MergeConfiguration:
// mock-primary.Description == "Updated provider description"
// mock-extra still present

The same focused cases pass on this branch:

go test ./internal/config -run 'TestMergeConfiguration_ServerSends(NewRoles|UpdatedRole|NewWorkflows|UpdatedWorkflow|NewProviders|UpdatedProvider|PartialConfig_OnlyRoles)|TestApplyPatch_AppliesRoles'

So the red/green story here is: upstream computes and validates the incoming synced definitions, but the live c.Roles.Definitions, c.Workflows.Definitions, and c.Providers.Definitions maps are not updated. This branch fixes that caller behavior and then hardens the sync path separately in the second commit.

Comment thread internal/config/sync.go
}

func (c *Config) updateRoles(roleConfig *RoleConfig) error {
_, err := c.ApplyRoles([]*models.RoleDefinitions{{
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the defs returned by ApplyRoles get thrown away here?

Comment thread internal/config/sync.go
}

// Create a patch to see the differences between existing and new
incomingPatch, err := jsonpatch.CreateMergePatch(existingData, newData)
Copy link
Copy Markdown
Collaborator Author

@michaelw michaelw Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this patch is sparse AIUI, contains only fields that changed:

{
  "roles": {
    "definitions": {
      "editor": {
        "description": "Can edit and publish"
      }
    }
  }
}

Comment thread internal/config/sync.go
// Convert patches back to structs - these are the NEW changes from the remote
// server that we need to apply to our existing configuration
var incomingDiff ConfigPatchRequest
err = json.Unmarshal(incomingPatch, &incomingDiff)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Unmarshal will do the wrong thing here for the example above:

ConfigPatchRequest{
    RoleConfig: &RoleConfig{
        Definitions: map[string]models.Role{
            "editor": {
                Name:        "",
                Description: "Can edit and publish",
                Enabled:     false,
            },
        },
    },
}

We can't discriminate whether Name: "" was an intended change, or just the zero default value backfilled by Unmarshal.

Comment thread internal/config/sync.go Outdated
// sparse diff payload. Unmarshaling the sparse merge patch here would
// collapse omitted fields to zero values in typed structs.
var mergedConfig ConfigPatchRequest
err = json.Unmarshal(newData, &mergedConfig)
Copy link
Copy Markdown
Collaborator Author

@michaelw michaelw Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the ugly bit: mergedConfig is always a full patch, so that no fields are zeroed. There is no point to apply the patch as before, hence use applyMergedConfig below.

Comment thread internal/config/sync.go
return err
}

return c.storeRoleDefinitions(merged.Definitions)
Copy link
Copy Markdown
Collaborator Author

@michaelw michaelw Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the changes are applied.

(The tricky bit here is that between the RLock above, and the Lock section in storeRoleDefinitions, some other update could have happened. That's addressed in the second commit of this branch.)

@michaelw michaelw added the fix Bug fix (patch version bump) label Apr 15, 2026
Unmarshal the full merged config in MergeConfiguration and apply that end state directly instead of routing a sparse diff back through the sync apply path.

Keep applyPatch as the helper for real partial section diffs, and factor shared normalization/store helpers so both flows continue to validate and normalize definitions before persisting them.
Snapshot the current config generation before building the merged sync view, normalize the merged role/workflow/provider definition maps off-lock, and only commit them if the generation is unchanged.

Keep the retry logic scoped to MergeConfiguration, compare and commit definitions only, and detach the snapshot through JSON so stale retries do not alias nested state. Reloaded definitions now bump the generation counter, while broader nested-mutation cleanup remains tracked in thand-io#306.
@michaelw michaelw force-pushed the mw/fix-config-sync-apply branch from 6621fa2 to 7026fac Compare April 20, 2026 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Bug fix (patch version bump)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants