Stable & Syncing — a graph of Dark instances that reconcile#5676
Draft
StachuDotNet wants to merge 25 commits into
Draft
Stable & Syncing — a graph of Dark instances that reconcile#5676StachuDotNet wants to merge 25 commits into
StachuDotNet wants to merge 25 commits into
Conversation
Package items are stored as an append-only log of ops (package_ops); the package tables (functions, types, values, locations, dependencies, deprecations) are regenerable projections folded from that log. Adds the projection registry and rebuild/refold, so the log is the canonical source of truth and the projections can be dropped and rebuilt at will.
A single extensible socket on ExecutionState turns a runtime conflict (a missing function, a sync divergence, ...) into a policy decision: substitute a value or fail loudly. The default policy fails loudly, byte-identical to before, and it's wired into the interpreter's missing-function path so a policy can later resolve it instead.
Replicate the package_ops log across instances over a file or HTTP/Tailscale transport, applying remote ops through the same idempotent path as local ones. Instances converge by each op's portable authoring time (last-writer-wins; same-millisecond ties broken deterministically by content hash); auto-resolved name-binding divergences are recorded for review and routed through the conflict-dispatch seam. Only committed ops are shared (the commit is the unit of sync), and a mismatched-version peer is paused with a clear upgrade message rather than a decode error. Includes the sync builtins, the sync CLI, and an always-on autosync daemon.
When schema.sql changes, drop only the regenerable projection tables and re-fold the surviving op log, instead of dropping every table. The canonical log, blobs, and branch/commit state come through identical, so a schema change can't lose authored work.
A merge silently lets the merging branch win when both branches bound the same name to different content. Detect those collisions and record each in the same reviewable conflict store sync uses, so concurrent edits across branches converge visibly instead of overwriting without a trace.
A single Release version (the sync wire version) gates cross-instance compatibility and local store upgrades. The migrator is a forward-only registry of steps with a boot guard: a store from a newer Release is refused, an older store is migrated forward, and a fresh store is stamped at the current Release. A step is either a durable migration (copy-and-swap SQL + an optional one-shot op re-serialize + a projection re-fold) or a clean-break boundary that clears the package dataset so it rebuilds from source or re-pulls from a same-Release peer.
Hash the alpha-normalized canonical form of a package item, so bound-variable names (parameters, let/lambda/match binders, and their uses) no longer affect its content hash. Two functions identical up to a parameter rename share one hash, while which argument is used, binder order, and shadowing stay distinct. This keeps an item's identity stable across an op-format or language change, so sync sees no phantom divergences.
A follow-up readability pass over the sync code; behavior and the full suite are unchanged. - Extract `parseLocation` / `formatLocation` (the FQ "owner[.modules].name" inverse pair) and reuse them in `detectDivergences`, `liveBindingHash`, and `resolveConflict` instead of open-coding the split/join at each site. - Reshape `divergentBindings` to return the structured `PackageLocation` + hashes, so `detectDivergences` just renders the location — removing its unreachable match arm. - Extract `restampAndRefold`, shared by the automatic keep-local policy (`routeDivergences`) and the human 'mine' override (`resolveConflict`), which re-stamped + re-folded an op identically. - Flatten `opKindBreakdown`'s nested `List.append`; extract `Display.divergenceNote`, shared by the HTTP and file-pull branches of `dark sync pull` so both word the divergence note identically. - Note that `pmSyncOpsSince` is committed-only today (so equivalent to `pmSyncOpsSinceCommitted`).
Make always-on sync a first-class `dark apps` daemon and give it structured telemetry.
App + lifecycle:
- Register "Sync" in the apps catalog (Daemon target `Darklang.Sync.Daemon.runManaged`), so
`apps add/start/stop/status/logs sync` and `apps enable sync` all work with no new plumbing.
- Unify the daemon identity on one pidfile ("sync") across the manual `sync daemon …` subcommands
and the apps surface, so both manage the same process.
- New `apps enable sync --boot` enables systemd user-lingering so it starts at boot, not just login.
The poll loop already backs off cleanly when tailscale isn't up yet, so it self-heals on wake
without needing network-ordering in the unit.
Structured telemetry:
- A `sync_daemon_events` table records one row per poll cycle (peers polled, changed, conflicts,
skews), trimmed to the most recent rows; the tailnet loop writes a row each cycle.
- `pmSyncRecordDaemonEvent` / `pmSyncRecentDaemonEvents` expose it, and `sync events` renders recent
cycles as a table (`Display.daemonEventsTable`, pure + testfile-covered) instead of scraping logs.
Fixes and tests on top of the op-log / durable-canon / meaning-stable-hashing foundation, bringing the branch to a mergeable, deployable state: - Boot: a fresh build crashed in durable-canon because schema.sql's projection tables lacked `description` (it lived only in an incremental). Add it to schema.sql, drop the incremental, guard it with a regression test. - Portable login: config was written to a dead relative path so login never persisted; derive the config path from the executable dir (beside data.db). - Conflict overrides now propagate cross-machine: `conflicts resolve mine` used to re-stamp the existing op in place (same commit-rowid, so peers that had already pulled it never re-adopted). Emit a distinct OverrideName op carrying a resolver stamp so it rides the next incremental pull and wins LWW; tested with a binary round-trip and a receiver-side end-to-end test. - Sync commands return a non-zero exit code on failure (a script wrapping `dark sync pull` can now detect it). - Single-peer sync daemon records telemetry, so `dark sync events` is populated for `daemon start <peer>`, not only the tailnet-wide loop. - Don't crash reading an unevaluated package value (rt_dval NULL). - Schema housekeeping: move the migrator bookkeeping tables and the package_ops composite-PK declaration into schema.sql (single source for all CREATE TABLE).
…licy seam)
Step 1 of the conflict/resolution redesign. De-conflate the two domains that
were sharing one RT dispatch:
- ProgramTypes: add SyncConflict (one case, Divergence of location * candidates),
ResolvedBy (Auto of policy | Human), and DivergenceResolution { chosen; by } —
beside PackageOp, because a sync conflict is a disagreement about the op log and
its resolution is itself an op. Per-kind resolution shape, not a global enum.
- RuntimeTypes: Conflict is now just FnNotFound (runtime-only); Resolution is
Substitute | FailLoudly (dropped the C*/R* prefixes, dropped the sync-divergence
and unused-runtime-error cases). The execution seam stays for missing-fn dispatch.
- Sync: new SyncPolicyChoice (AcceptLww | OverrideTo of Reference), SyncPolicy, and
defaultSyncPolicy = AcceptLww. routeDivergences builds a first-class
PT.SyncConflict.Divergence and consults the sync policy instead of the RT dispatch.
Behavior preserved: default keeps LWW standing; keep-local mints an OverrideName.
- PM/Sync builtins pass defaultSyncPolicy; tests migrate dispatches -> sync policies.
Full backend suite green (9,787 passed). Behavior unchanged.
…time "can't proceed") Follow-up to the previous commit. The runtime Conflict/Resolution/ConflictDispatch seam had exactly one live consumer (the missing-package-fn site) and its only real behavior was raiseRTE(FnNotFound) — the Substitute arm was explicitly unwired. It duplicated RTE.FnNotFound and modeled a "runtime conflict" that doesn't exist yet. So "conflict" is now a sync-only concept: - RuntimeTypes: delete Conflict, Resolution, ConflictDispatch, and the ExecutionState.conflictDispatch field. Keep CallContext (it's now purely the sync-policy context, assembled from ExecutionState + VMState). - Execution: drop the default dispatch initializer. - Interpreter: the missing-package-fn site goes back to raiseRTE(FnNotFound). - Tests: drop ConflictDispatch.Tests (RT-only seam tests); sync-conflict coverage stays in SyncScenarios.Tests. A later PR will give the runtime genuine conflict handling — park-and-write-on-demand (PDD-style), then resume — at which point a seam returns, designed against that real requirement. Until then RuntimeError is the model. A breadcrumb at the interpreter site and on CallContext records this. Full backend suite green (9,785 passed).
Step 2 of the conflict redesign. Tag-byte write/read mirroring PackageOp.fs: - SyncConflict.Divergence: tag 0, then PackageLocation + List<Reference>. - ResolvedBy: Auto (tag 0, policy string) | Human (tag 1). - DivergenceResolution: chosen Reference + ResolvedBy. Reference reuses the existing PackageOp serializer so a Reference has one wire shape everywhere. Exposed via Serialization.fs (SyncConflict / DivergenceResolution modules) the same way as PackageOp. A round-trip test covers SyncConflict + both ResolvedBy cases. Full backend suite green (9,786 passed, +1 for the new round-trip test).
…resolution
Step 3 of the conflict redesign. The table no longer stores flat
local_hash/incoming_hash/resolution-prose/acknowledged/overridden; it stores the
structured conflict and its resolution:
(id, kind, location, conflict_blob, chosen_hash, resolved_by,
override_op_id, remote, detected_at, status)
- conflict_blob: a serialized PT.SyncConflict (the candidates).
- chosen_hash + resolved_by: the resolution ('auto:last-writer-wins' | 'human' | ...).
- override_op_id: the OverrideName op a deliberate override mints (NULL until step 5).
- status: review lifecycle ('auto-resolved' | 'acknowledged' | 'overridden').
Conflicts.fs rewritten: record() builds + serializes the Divergence and dedups on
the exact blob; list()/getById() deserialize it. localHash/incomingHash/acknowledged/
overridden are now derived members over the structured record, so the builtins +
tests that read them stay unchanged. recordDivergences + Merge pass chosen_hash +
resolved_by. A temporary `resolution` prose bridge keeps the .dark display green until
the next step restructures it.
Local/disposable table → schema.sql change only (no Release step; the test DB is
disk-mode, rebuilt fresh). Full backend suite green (9,786).
…, no prose parsing
Step 4 of the conflict redesign. The `dark conflicts` display now renders the
structured resolution instead of parsing prose:
- pmConflictsList returns (id, location, status, chosenHash, resolvedBy, localHash,
incomingHash, remote) — chosen hash + the policy that picked it.
- display.dark: conflictWinner reads the winner STRUCTURALLY (chosen == local/incoming)
instead of String.contains on resolution prose; conflictVerdict labels by resolvedBy
('last-write-wins' / 'override' / 'merge'). The conflicts-list testfile rewritten to
the structured signatures.
- Conflicts.fs: the temporary `resolution` prose bridge is gone; the two F# tests that
read it now assert resolvedBy directly.
Full backend suite green (9,788).
… 5a-i) Per the design correction, an override is not a new op — it's a synced Resolution overlaid on the op-fold. This adds that mechanism (additive; OverrideName still in place, removed in a later step): - schema: a `resolutions` table (id, location, item_type, chosen_hash, resolved_by, branch_id, at) — synced decisions that override the op-fold for a contested name. - LibDB.Resolutions: mk/record/applyToLocations/recordAndApply/list. applyToLocations re-binds the location to the chosen content, gated by the SAME timestamp-LWW applySetName uses (a resolution whose `at` is older than the current binding is skipped; an exact tie breaks by the higher hash, portably) — so it converges. The effective binding becomes: fold(ops) [LWW] → then apply resolutions per location [last-resolver-wins by `at`]. A resolution's fresh `at` is what lets a "keep mine" decision win where re-emitting the original SetName (same content hash → same op id) could not — the reason OverrideName existed. A test covers the overlay (newer resolution overrides; a stale one is skipped). Full backend suite green (9,789).
The resolution overlay now has a sync channel, mirroring the op channel so a synced decision propagates cross-machine without a new op: - sync_cursors gains resolutions_through_rowid; SyncCursors gets resolutionCursorFor/advanceResolutionCursor (a separate per-peer cursor). - Resolutions: a shared row-reader (ofRow), `since cursor` (the sender read), and applyToLocations is now idempotent (skips when already bound to chosen, so a re-pulled resolution doesn't churn locations). - Sync: encodeResolutions/decodeResolutions (version-guarded, mirroring encodeBatch) + applyRemoteResolutions (record + fold each + advance the cursor). A test ships a resolution over the wire (encode→decode) and a peer adopts it, idempotently. Additive — overrides still mint OverrideName for now; the switch + pull-path integration follow. Full backend suite green (9,790).
pullFromFile now also pulls the resolution channel: pullResolutionsFromStore reads the peer's `resolutions` table above our resolution-cursor, records + folds each into locations (the overlay), and advances the cursor — alongside the op + blob pulls. So `dark sync pull <file>` propagates override decisions cross-machine. Tolerant of a peer with no `resolutions` table (older store / minimal test db → nothing to pull). A test builds a peer db whose only content is one resolution and asserts pullFromFile applies it to this instance's binding (the op/blob channels empty, so it isolates the resolution pull). Full backend suite green (9,791; a flaky cross-list global-connection contention cleared on re-run — unrelated to this change).
…(step 5a-ii)
overrideBinding (the keep-local policy + the human 'mine' override) now records a
Resolution and applies it via the overlay, instead of minting an OverrideName op:
- Sync.overrideBinding: writes Resolutions.recordAndApply with a fresh `at` and the
resolver tag ('auto:keep-local' or 'human'). No op; the decision rides the resolution
channel and wins timestamp-LWW.
- Seed.rebuildProjections: after the op-fold, Resolutions.applyAll re-applies every
resolution over the rebuilt locations — so overrides survive a projection refold (the
op log alone doesn't carry them). resolutions isn't a projection table, so a rebuild
doesn't clear it.
- The keep-local regression test now asserts NO op is appended + a resolution choosing
our hash is recorded. CLI copy updated: a resolution syncs on the next pull (no commit).
OverrideName is now unused (removed next). Full backend suite green (9,791).
Nothing emits OverrideName anymore (overrides are synced Resolutions), so remove it. A clean F#-only deletion — it already surfaced to the Dark side AS SetName (so the Dark PackageOp type is unchanged, no hash ripple), and under the clean break no stored op is an OverrideName: - ProgramTypes: drop the PackageOp case. - Binary serializer: drop tag 8 (write + read). Other ops' tags/hashes unchanged. - ProgramTypesToDarkTypes / PackageOpPlayback / PackageManager / Seed: drop the arms (each folded/mapped exactly like SetName). - SyncScenarios.Tests: remove overrideOpRoundTrips + overridePropagatesToPeer (they constructed OverrideName ops); the resolution overlay/wire/file-pull tests cover the replacement end to end. The op log is now purely authored content/structure; overrides live in the resolution overlay. Full backend suite green (9,789).
The override channel now rides HTTP too, not just the file pull — so resolutions propagate over the tailnet path the autosync daemon uses. Mirrors the op channel: - builtins: pmSyncResolutionsSince (server read), pmSyncResolutionCursorFor (the client's separate resolution cursor), pmSyncApplyResolutions (client decode + fold). - server.dark: a GET /sync/resolutions?since=<cursor> route. - api.dark: resolutionsUrl + pullResolutions; pullHttp now pulls ops → blobs → resolutions. Each new builtin has exactly one .dark call site. Resolutions sync immediately (a decision is published when made, not gated by commit). Full backend suite green (9,789).
Two scenarios the overlay model needs pinned: - a genuinely NEWER authored op supersedes an older resolution (the overlay isn't a permanent pin — convergence is still timestamp-LWW across ops AND resolutions). - Resolutions.applyAll re-applies a recorded override after its binding is cleared (the refold-safety integration: rebuildProjections re-applies resolutions over the rebuilt op-fold). Exercised at single-location grain to avoid a global rebuild. Full backend suite green (9,791).
…arsing Tightening pass (no behavior change): - Tree-shake the dead DivergenceResolution binary serializer. A resolution is stored flattened to columns (chosen_hash + resolved_by), never serialized — so the serializer (Serialization.fs module + SyncConflict.fs's ResolvedBy/DivergenceResolution write/read) and its round-trip test were used only by the test. Removed; SyncConflict's serializer stays (it backs conflict_blob). The DivergenceResolution/ResolvedBy *types* are kept as the model vocabulary (now orphan — flagged for a possible follow-up). - Tree-shake the unused override_op_id column + Conflict.overrideOpId field: always NULL, nothing wrote or read it. - Consolidate location FQN parse/format: Sync.formatLocation/parseLocation + Conflicts.parseLoc + the inline modules-split (3 copies) now route through PackageLocation.toFQN/fromFQN/modulesOfString (one home). Full backend suite green (9,791).
The subtle convergence rule (older-by-stamp loses; an exact same-stamp tie breaks by the higher content hash, portably) was duplicated verbatim in the op fold (applySetName) and the resolution overlay (applyToLocations) — two copies of the logic that MUST agree for instances to converge. Extracted to PackageLocation.bindingIsStale, used by both, with a direct unit test pinning the contract. No behavior change. Full backend suite green (9,792).
Their only consumer (the binary serializer) was removed in the prior tightening commit,
leaving them fully orphan — the code represents a resolution as chosen_hash + resolved_by
columns ('auto:<policy>' | 'human'), never the typed value. Remove them. SyncConflict
(the one serialized, constructed type) stays; its doc now describes the resolution as the
flattened decision it actually is, and drops the stale "its resolution is itself an op"
line (it's a synced overlay, not an op).
Full backend suite green (9,792).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(description is currently AI-gen'd and needs refinement. will do before requesting review)
The goal
Use Dark across all my devices. Each device runs a Dark instance; together they form a graph on my
tailnet that reconciles its package changes. Author a function on the laptop, it shows up on the desktop.
The instances agree on the same code with no central server in the loop.
Two properties make it safe to live in:
auto-resolves and records the race so I can follow up.
graph converges without a coordinator — and a re-pull or full replay reproduces it.
Resolutionoverlaid on theop-fold, so peers adopt your choice on their next pull — without inventing a new op.
Foundation 1 — ops ⊥ projections
The op log is the source of truth; everything you see is a projection of it. Every package change (add a
fn, rename, deprecate, propagate) is an op in a branch-scoped, append-only log. The tables you query
(functions/types/values, locations, dependencies) are projections folded from that log;
package_blobsis canonical content, not a projection.rebuildProjectionsclears the projections, marksops unapplied, and re-folds the whole log → byte-identical tables. Surfaced as
dark status(ops in the logvs folded-through — is the cache current?) and
dark branch rebuild.This is what makes sync safe: replicating the op log and re-folding it is the same operation as a local
edit. Losing a projection costs only CPU; the ops are what matter.
Foundation 2 — conflicts are local; only resolutions sync
A sync conflict is a disagreement about the op log — today one kind: a name bound to two contents across
instances (a SetName race). Modeled as
SyncConflict(a DU besidePackageOp, one caseDivergence of location * candidates). Conflicts are detected locally during the fold and re-derivable — every instancereaches the same conflicts deterministically, so they never sync. A resolution (which candidate is
chosen, and by whom —
auto:<policy>orhuman, stored as a chosen hash + a resolved-by tag) is the onlything besides ops that travels.
Sync
A receiver pulls the ops it hasn't seen (per-peer cursor) and folds them through the same playback path a
local edit uses. Apply is idempotent (
INSERT OR IGNOREby content-hash id), so re-pulling is a no-op anda full replay reproduces the identical projection. Ops fold onto the branch they belong to.
Conflicts & resolutions. A SetName race auto-resolves by choosing whichever op was written later —
each op carries a portable
origin_tsauthoring time, kept beside the op so its content hash is unchanged;every instance computes the same winner regardless of arrival order, and an exact tie breaks deterministically
by content hash. The race is recorded, never lost:
ackit, orresolve <id> mine|theirsto override.The conflict report is a pure Dark package (
Sync.Display.conflictReport) over structured rows from thebuiltin, so the UX is package-testable and iterable without an F# rebuild.
How an override propagates (the subtle part). Sync is incremental — a peer only pulls above its cursor.
So an override can't just re-emit the original
SetName: that op is content-addressed, so re-emittingproduces the same op id and a peer that already pulled it sees no change. The fix is to not make it an op
at all. An override is a
Resolution— a separate synced decision overlaid on the op-fold, carrying a freshstamp; the effective binding is
fold(ops)then apply resolutions per location (last-resolver wins). Itrides its own channel (a
resolutionstable + cursor + wire codec, on both transports) and syncs immediately(published when made, not gated by commit). So a re-pulling peer adopts it and re-binds. (This replaces the
earlier
OverrideNameop, which existed only to manufacture a distinct hash; removed.)Transport — two carriers; the reconciler doesn't care which:
dark sync pull <peer's data.db>. Direct, offline, no server./sync/{events,blobs,resolutions,health}; the client pulls thedelta since its cursor. Trust model: machines on the tailnet are trusted (identity = the
Tailscale-User-Loginheader, the tailnet is the boundary;httpClientGetUnsaferelaxes SSRF only forthat). MagicDNS + TLS for free. (A public hub with bearer auth is designed-for, out of scope here.)
Every op kind rides sync.
applyOphas no wildcard, so the compiler forces everyPackageOpkind to befolded on the receiver;
opsSinceships every kind; the wire frames the raw blob byte-exact. A propagationrides as its companion
SetNameops, so dependents repoint too.Stable — your work survives upgrades
Syncing is only safe to live in if your data survives the system changing under you.
(fine for a seeded dev box, fatal for real work). Now it drops only the regenerable projections and
re-folds the surviving log; the canonical op log / blobs / branch+commit state come through identical.
The line between "a neat demo" and "I can keep my life here."
/sync/health(release=N; ops=M) —the single version that gates cross-instance sync.
never corrupts);
dark sync checksays it plainly — " is on Release N, you're on M — upgrade" — instead of a raw decode error.
LibDB/Releases.fs). One coordinate gates upgrades and cross-instance sync, with aboot guard: a newer-Release store is refused, an older one migrated forward, a fresh one stamped current. A
step is either data-preserving (copy-and-swap SQL + optional op re-serialize + a projection re-fold) or a
clean-break boundary that rebuilds the package dataset from source / a same-Release peer. This is the seam
we grow into "port your state from alpha1 → alpha2."
Also landed:
names don't affect identity —
fn add x yandfn add a bare the same item; argument use, binder order,and shadowing stay distinct. Identity is stable across a rename or an op-format change, so sync sees no
phantom divergences.
dark appsentry (apps start sync,apps enable syncfor start-at-boot), observable and lifecycle-managed like any other app, with structuredper-cycle telemetry via
dark sync events.What's coming (the foundation is built to grow)
locationsmachinery asSetName. New conflict
CMoveCollision.the stuff you keep across devices). New conflict
CValueUpdateRace.(
CCapabilityDenied) — routed through the same spine. A new kind is ~four touches, no migration.Status
auto-resolve and overrides propagate.