Skip to content

fix: detect and repair disconnected metadata branches at read/write time#533

Open
dvydra wants to merge 19 commits intomainfrom
fix/reconcile-disconnected-metadata-branch
Open

fix: detect and repair disconnected metadata branches at read/write time#533
dvydra wants to merge 19 commits intomainfrom
fix/reconcile-disconnected-metadata-branch

Conversation

@dvydra
Copy link
Contributor

@dvydra dvydra commented Feb 26, 2026

Summary

Detects and repairs disconnected entire/checkpoints/v1 metadata branches — a state caused by the empty-orphan bug (fixed going forward by #511). Users who hit that bug and continued working have local checkpoint data on a completely disconnected history from remote (no common ancestor).

Key distinction: Diverged (shared ancestor, unique commits on each side) is normal multi-dev flow handled by the push path's tree merge. Disconnected (no common ancestor) is the bug state that this PR repairs.

What changed

New: cherry-pick reconciliation (strategy/metadata_reconcile.go)

When local and remote share no common ancestor, local commits are cherry-picked onto the remote tip:

  1. Detect via git merge-base (exit 0 = shared ancestry, exit 1 = disconnected)
  2. Walk local branch tip→root, collect commits oldest-first
  3. Skip empty-tree commits (the orphan bug artifact)
  4. For each commit: compute full delta from parent (adds + modifies + deletes), apply onto current tip
  5. Update local branch ref

Checkpoint shards use unique paths (<id[:2]>/<id[2:]>/), so cherry-picks always apply cleanly. All data is preserved with linear history.

Reconciliation runs at read/write time, not enable time

Runs once per process (via sync.Once) when commands actually access the metadata branch — explain, rewind, resume, clean, condensation. Three call sites:

  • ManualCommitStrategy.getCheckpointStore() — covers strategy-mediated reads/writes
  • getBranchCheckpoints() in explain.go — bypasses strategy, creates own GitStore
  • ListCheckpoints() in strategy/common.go — opens own repo, reads branch directly

This is the right place because enable may run before the user has fetched remote data.

Simplified EnsureMetadataBranch (runs during enable)

Now only handles:

  • Create local from remote when local doesn't exist
  • Replace empty orphan with remote
  • Create empty orphan when neither exists

Removed syncMetadataBranch and isAncestorCLI — disconnected repair is now handled by EnsureMetadataReconciled.

User-facing output

[entire] Detected disconnected session metadata (local and remote share no common ancestor)
[entire] Cherry-picking 3 local checkpoint(s) onto remote...
[entire] Done — all local and remote checkpoints preserved

Files

File Change
strategy/metadata_reconcile.go New — detection + cherry-pick repair
strategy/metadata_reconcile_test.go New — 7 tests covering no-ops, disconnected repair, multi-commit, modifications
strategy/common.go EnsureMetadataReconciled() gate; simplified EnsureMetadataBranch; removed syncMetadataBranch/isAncestorCLI
strategy/manual_commit.go Reconciliation call in getCheckpointStore()
strategy/common_test.go Updated existing tests for new behavior
cli/explain.go Reconciliation call in getBranchCheckpoints()

Test plan

  • mise run fmt && mise run lint && mise run test:ci
  • go test ./cmd/entire/cli/strategy/ -run TestReconcileDisconnected -v — all 7 tests pass
  • Manual: clone repo with remote checkpoints, create disconnected local orphan, run entire explain, verify reconciliation and explain succeeds

🤖 Generated with Claude Code

dvydra and others added 3 commits February 26, 2026 16:35
When enable previously created an empty orphan entire/checkpoints/v1
and the remote later has real checkpoint data, detect the empty local
branch and update it from the remote-tracking branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 15849a699702
When a user hit the empty-orphan bug and then worked normally, the local
entire/checkpoints/v1 has real checkpoint data on a disconnected history
from the remote. EnsureMetadataBranch now detects this using
git merge-base and reconciles by merging both trees (checkpoint shards
are unique, so no conflicts). Also handles the simpler case where local
is behind remote (fast-forward).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: cee58fd2febc
…ite time

Move detection and repair of disconnected local/remote metadata branches
from EnsureMetadataBranch (enable) to a sync.Once-guarded reconciliation
that runs when commands actually access the metadata branch (explain,
rewind, resume, etc.). Uses cherry-pick strategy instead of merge to
preserve linear commit history.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 639c85c9977a
Copilot AI review requested due to automatic review settings February 26, 2026 22:50
@dvydra dvydra requested a review from a team as a code owner February 26, 2026 22:50
@cursor
Copy link

cursor bot commented Feb 26, 2026

PR Summary

Medium Risk
Changes git-ref handling and creates new commits to repair metadata history; mistakes could hide or overwrite checkpoint metadata despite added test coverage.

Overview
Fixes corrupted/disconnected checkpoint metadata histories. Adds ReconcileDisconnectedMetadataBranch to detect when local and origin/entire/checkpoints/v1 share no merge-base and repair by cherry-picking local metadata commits onto the remote tip (skipping empty-tree orphan commits) and updating the local ref.

Moves reconciliation to metadata access paths (once per process). EnsureMetadataReconciled (sync.Once gated) is invoked before listing/reading checkpoints in ListCheckpoints, ManualCommitStrategy.getCheckpointStore, and explain’s getBranchCheckpoints, emitting warnings on failure.

Tightens entire enable behavior. EnsureMetadataBranch now only seeds the local branch from origin when missing or when the local branch is an empty orphan, and explicitly avoids reconciling/fast-forwarding non-empty or diverged branches; tests were expanded and new reconciliation tests were added.

Written by Cursor Bugbot for commit 86d7433. Configure here.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR moves “disconnected metadata branch” detection/repair out of entire enable and into a read/write-time reconciliation path, aiming to automatically fix repos affected by the historical empty-orphan metadata branch bug while preserving linear history.

Changes:

  • Adds ReconcileDisconnectedMetadataBranch (+ sync.Once-guarded EnsureMetadataReconciled) to detect disconnected entire/checkpoints/v1 local vs origin/* and repair via cherry-picking.
  • Updates several read/write entrypoints to call reconciliation before accessing checkpoint metadata.
  • Expands strategy tests to cover new metadata-branch behaviors and adds a dedicated reconciliation test suite.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
cmd/entire/cli/strategy/metadata_reconcile.go Implements disconnected-branch detection and cherry-pick-based repair logic.
cmd/entire/cli/strategy/metadata_reconcile_test.go Adds unit tests for reconciliation scenarios (no remote/local, shared ancestry, disconnected, multi-local commits).
cmd/entire/cli/strategy/common.go Adds EnsureMetadataReconciled gate; adjusts EnsureMetadataBranch; calls reconcile before listing checkpoints.
cmd/entire/cli/strategy/common_test.go Adds tests for updated EnsureMetadataBranch semantics and new helper cloneWithConfig.
cmd/entire/cli/strategy/manual_commit.go Ensures reconciliation is invoked before initializing the checkpoint store.
cmd/entire/cli/explain.go Ensures reconciliation is invoked before reading committed checkpoints for explain/rewind points.

dvydra and others added 6 commits February 27, 2026 10:00
Only treat plumbing.ErrReferenceNotFound as "no branch" when checking
local and remote metadata refs. Other errors (e.g. corrupt packfiles)
are now surfaced instead of silently swallowed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 95aa08abfa40
git merge-base uses exit code 1 for "no common ancestor" and 128+ for
actual errors (corrupt repo, invalid hash, timeout). Only treat exit
code 1 as "disconnected"; surface other failures as errors to prevent
spurious cherry-pick reconciliation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: e73dd6177226
…tions

The previous additions-only diff missed modifications (e.g., multi-session
condensation updating metadata.json) and deletions. Now computes the full
diff between each commit and its parent: entries with changed hashes are
applied, and entries removed in the commit are deleted from the tip tree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 562308d98f2b
Previously, if isEmptyMetadataBranch returned an error (e.g., corrupt
object, missing commit), EnsureMetadataBranch silently skipped the
empty orphan check, potentially leaving the repo in a broken state.
Now surfaces the error to callers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: fd02a6f99c6c
Hoist reconcileResult to package level so subsequent callers of
EnsureMetadataReconciled see the same error, not nil.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: fc3459657b3d
@dvydra dvydra requested a review from Copilot February 26, 2026 23:38
@dvydra
Copy link
Contributor Author

dvydra commented Feb 26, 2026

bugbot run

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

…l-flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 5d5f08442ce7
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

dvydra and others added 7 commits February 27, 2026 10:53
Add a test utility to reset the sync.Once reconciliation gate, matching
the existing resetProtectedDirsForTest pattern. Prevents stale cached
results from leaking across tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: c20cb099d45c
…e trees

A corrupted tree object during data repair should abort reconciliation,
not silently drop the commit and lose user data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 49496929adf3
A partial FlattenTree read on the parent would produce an incorrect
delta (treating unread entries as deleted). Propagate all three error
paths: parent commit, parent tree, and FlattenTree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: c2930cbb7765
Prevents silent data loss if the local metadata branch has more than
1000 commits — reconciliation aborts with an error instead of silently
truncating the chain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: e59e395232b3
The repo parameter was only used on the first call due to sync.Once,
making the API misleading. Now opens its own repository internally,
matching the ListCheckpoints pattern and simplifying all three callers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 6242e73a4234
Log at debug level when EnsureMetadataBranch detects a non-empty local
branch that differs from remote, making it visible in logs that
reconciliation was deferred to read/write time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: f6f13afacc0a
Move the debug log into an else clause so it only fires when the local
metadata branch has real data that differs from remote. Previously it
fell through after the empty-orphan update case, logging misleadingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 851f7418bdb5
The function was never called — no tests trigger EnsureMetadataReconciled
through the sync.Once gate. Remove it and the comment referencing it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: f5b938b258c8
@dvydra
Copy link
Contributor Author

dvydra commented Feb 27, 2026

Manual test

Tested against a fresh repo with real remote checkpoints and a disconnected local metadata branch.

Setup:

  1. Created test repo, did work with Claude, pushed (pre-push hook pushed entire/checkpoints/v1)
  2. Cloned into /tmp/frog, ran entire enable — this created a disconnected local orphan (empty, no checkpoint data) vs remote's real checkpoint data
  3. Built the fix branch: go build -o /tmp/entire ./cmd/entire

Result:

$ /tmp/entire explain --no-pager
[entire] Detected disconnected session metadata (local and remote share no common ancestor)
[entire] Done — local had no checkpoint data, reset to remote
Branch: main
Checkpoints: 1

[0c66dc95f5c0] "make a test file with a frog ascii in it."
  02-27 12:24 (a2d64e9) Add enterprise-ready frog ASCII art with legal disclaimer

Reconciliation detected the disconnected state, recognized the local branch had no real checkpoint data (just the empty orphan), reset to remote, and explain succeeded showing the remote checkpoint.

@dvydra
Copy link
Contributor Author

dvydra commented Feb 27, 2026

Manual testing (round 2 — full cherry-pick path)

Test 1: Empty orphan reset (frog)

Cloned repo with remote checkpoints. entire enable created a disconnected empty orphan. Fix binary detected and reset to remote:

$ /tmp/entire explain --no-pager
[entire] Detected disconnected session metadata (local and remote share no common ancestor)
[entire] Done — local had no checkpoint data, reset to remote
Branch: main
Checkpoints: 1

[0c66dc95f5c0] "make a test file with a frog ascii in it."
  02-27 12:24 (a2d64e9) Add enterprise-ready frog ASCII art with legal disclaimer

Test 2: Cherry-pick with real local checkpoints (frog4)

Created a properly disconnected empty orphan (simulating the pre-#511 bug), then used Claude + stock entire to create a real commit with a real checkpoint on the disconnected branch. Did NOT push.

Stock entire (before fix) — only sees local checkpoint, no remote data:

$ entire explain
Branch: multi-amphibian-architecture
Checkpoints: 1

[8b68bd0ab277] "double frogs lol"
  02-27 14:26 (c9a79b8) Migrate to multi-amphibian microswamp architecture

Fix binary — detects disconnected state, cherry-picks local onto remote:

$ /tmp/entire explain --no-pager
[entire] Detected disconnected session metadata (local and remote share no common ancestor)
[entire] Cherry-picking 1 local checkpoint(s) onto remote...
[entire] Done — all local and remote checkpoints preserved
Branch: multi-amphibian-architecture
Checkpoints: 1

[8b68bd0ab277] "double frogs lol"
  02-27 14:26 (c9a79b8) Migrate to multi-amphibian microswamp architecture

After reconciliation, switching to main shows all checkpoints:

$ /tmp/entire explain --no-pager
Branch: main
Checkpoints: 2

[e5b34036e8da] "let's review our last pr, with the hate"
  02-27 13:05 (d603ca8) hat

[0c66dc95f5c0] "make a test file with a frog ascii in it."
  02-27 12:24 (a2d64e9) Add enterprise-ready frog ASCII art with legal disclaimer

Both local and remote checkpoint data preserved. Reconciliation runs once, subsequent explains are instant.

refName := plumbing.NewBranchReferenceName(paths.MetadataBranchName)

// Check if remote-tracking branch exists (e.g., after clone/fetch)
remoteRefName := plumbing.NewRemoteReferenceName("origin", paths.MetadataBranchName)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably a problem. origin is a convention. how should we expect to know what the correct upstream is? iterate through all remotes? pick one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this is a common pattern. unlikely a problem for most github users, but it's an assumption we might need to not make in future

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My remote is never called origin 🙃 - it's usually "gh" if it's on GitHub

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that origin pattern is spread around the cli: https://github.com/search?q=repo%3Aentireio%2Fcli+%5C%22origin%5C%22&type=code

is this a bigger problem?

Copy link
Contributor

@khaong khaong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a well-designed repair mechanism for the disconnected metadata branch bug state. The cherry-pick approach is correct, the distinction between disconnected vs diverged is clear, and test coverage is thorough (7 tests covering no-ops, single/multi-commit cherry-picks, and modifications).

A few items worth discussing inline — the most impactful is the permanent error caching via sync.Once.

Review generated by Claude, checked by Alex.

// branches. Safe to call multiple times — uses sync.Once internally.
// Opens its own repository to avoid capturing a caller's repo instance forever.
func EnsureMetadataReconciled() error {
reconcileOnce.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Permanent error caching concern: If ReconcileDisconnectedMetadataBranch fails due to a transient issue (e.g., git merge-base times out, repo temporarily locked), the error is cached permanently for the process lifetime via sync.Once. Every subsequent call returns the same stale error.

Since this runs in long-lived agent sessions, a transient failure during explain would prevent reconciliation from succeeding on a later rewind in the same session.

Worth considering whether a retry-on-error pattern would be more robust here — e.g., only cache on success, or use a sync.OnceFunc-style approach that retries on failure. At minimum, documenting this tradeoff in a comment would help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this one I'm not so sure about, will follow up on Monday)

// - 1: no common ancestor (disconnected)
// - 128+: error (corrupt repo, invalid hash, etc.)
func isDisconnected(repoPath, hashA, hashB string) (bool, error) {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ReconcileDisconnectedMetadataBranch creates its own context.Background() here, but callers like getBranchCheckpoints already have a ctx. Threading context through would allow cancellation to propagate and would be more idiomatic Go.

if setErr := repo.Storer.SetReference(ref); setErr != nil {
return fmt.Errorf("failed to update metadata branch from remote: %w", setErr)
}
fmt.Fprintf(os.Stderr, "✓ Updated local branch '%s' from origin\n", paths.MetadataBranchName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent output style: This uses fmt.Fprintf(os.Stderr, "✓ Updated local branch...") while ReconcileDisconnectedMetadataBranch uses the [entire] prefix convention (e.g., [entire] Detected disconnected session metadata...). The emoji also doesn't match the rest of the CLI's stderr output. Consider using [entire] prefix here too for consistency.

return nil
}

// isEmptyMetadataBranch returns true if the branch ref points to a commit with an empty tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This intentionally only checks the tip commit's tree. If someone had an empty orphan commit followed by a data commit, the tip wouldn't be empty and this wouldn't match — which is correct for the intended use case (the bug created a single empty orphan that's the tip). Worth a brief doc comment clarifying this only checks the tip, so future readers don't wonder about the multi-commit case.

// createCherryPickCommit creates a new commit on top of parent, preserving the
// original commit's message and author.
func createCherryPickCommit(repo *git.Repository, treeHash, parent plumbing.Hash, original *object.Commit) (plumbing.Hash, error) {
authorName, authorEmail := GetGitAuthorFromRepo(repo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: authorName/authorEmail are misleading variable names since they're used for the Committer field, not Author (which correctly preserves the original). Consider renaming to committerName/committerEmail for clarity.

}
})

t.Run("updates empty orphan from remote", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing t.Parallel() — this subtest uses temp dirs and go-git plumbing so it should be safe to parallelize. Adding it keeps test execution fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants