feat(sync): local-first multi-machine artifact sync#731
Conversation
|
Feedback is welcome. Still in draft mode since work and testing to this point has been completely agent driven, a combination of GPT-5.5 and Claude 4.8. Next up is manually trying various distributed machine scenarios and seeing how well any of this works in practice. Assuming the idea eventually proves out, I'm happy to split into smaller manageable PR. |
roborev: Combined Review (
|
## Summary - Keep the candidate-window and boundary-session behavior in `internal/postgres/push.go` unchanged for this PR, and batch the PostgreSQL-side comparison reads used to decide whether a candidate session can be skipped. - Implement new batched loaders in `internal/postgres/push_fingerprint.go` for message aggregates, message content hashes, role/time fingerprints, message flags, message system ordinals, token fingerprints, tool-call aggregates, tool-call fingerprints, and usage fingerprints, with chunking inside the helper when session counts exceed `ANY($1)` practicality. - Use the preloaded message and tool-call aggregates on the hot no-op path, and retry any comparison-preload SQL failure in a fresh transaction without the batched preload instead of continuing inside an already-aborted transaction. - Add targeted regression tests in `internal/postgres/push_test.go` and `internal/postgres/push_fingerprint_test.go` to cover the new batch-driven skip decision path and helper behavior with empty inputs. ## Scope - Files changed are `internal/postgres/push.go`, `internal/postgres/push_fingerprint.go`, `internal/postgres/push_test.go`, and `internal/postgres/push_fingerprint_test.go`. - No boundary/windowing semantics, no schema changes, and no changes to PR #731 or broader sync-work areas. ## Notes - A focused PG comparison query-count assertion was not added because the existing harness does not expose a stable helper-call/query metric for this exact path without adding brittle test-only instrumentation. - The review-driven follow-up keeps the existing non-batched fingerprint fallback, but now that fallback only runs from a clean transaction after preload failure instead of on the poisoned transaction that raised the preload error. Fixes #331 Co-authored-by: Rod Boev <rodboev@users.noreply.github.com>
|
Thanks for the review. Both findings were valid and are addressed in 2804870 and f252c35. High — Windows-invalid Note this changes the canonical on-disk HLC string ( Medium — divergent origin sources. Confirmed:
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
|
Thanks again. All three findings were valid and are addressed in 055f3b3. High — local metadata events missing from the replay register ( Medium — remote HLCs not observed by the local clock ( Medium — one unavailable target aborted the rest of the origin (
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
|
Thanks. Both convergence gaps were valid and are fixed in 1d8d24c and 8cac9ff. Medium — usage-only sessions never exported ( Medium — bulk star emitted no metadata events (
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
|
Thanks. Addressed in e77db3a, 4110dce, acbb789, and 6ea6fa8. Medium — Medium — unconditional S3 PUT violates write-once ( Medium — Medium — remote events applied before the HLC advances (
Claude Opus 4.8 reasoning-medium on behalf of maphew |
roborev: Combined Review (
|
6ea6fa8 to
18c0f18
Compare
roborev: Combined Review (
|
|
I will rebase this |
18c0f18 to
b228d18
Compare
roborev: Combined Review (
|
|
I'll continue to work a bit on this to see if I can get it into a state that I'm comfortable with |
b228d18 to
16e5a7b
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
efb934f to
550372b
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
eaf6694 to
f944578
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
|
@wesm thanks for taking a direct interest! It's appreciated. I'm largely away for this week plus a bit. I'm looking forward to exercising this first hand when I'm back : ) |
roborev: Combined Review (
|
roborev: Combined Review (
|
1cd7126 to
ebf0e73
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
|
I'll work on getting this landed after 0.35.0 is released, there is already a lot of stuff merged to main since the last release |
d64f853 to
1673d5a
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
Squashed follow-up changes: - fix(sync): harden artifact metadata convergence - fix(postgres): migrate pinned source index after column - fix(server): reject unsupported artifact uploads before write - fix(sync): export baseline curation metadata - fix(sync): observe peer metadata before baseline init - fix(sync): snapshot baseline before peer import - fix(sync): preserve metadata replay across imports - fix(server): gate artifact routes to local stores - fix(sync): gate metadata ledger behind artifact sync opt-in - refactor(sync): pin manifest wire DTO and clean replay internals Co-authored-by: Wes McKinney <wesmckinn+git@gmail.com>
bd82f22 to
b2c8e68
Compare
roborev: Combined Review (
|
Collision-resolved PostgreSQL pushes need every session-scoped side table and tombstone path to use the same resolved PG ID as the session upsert. Otherwise a collision can leave aliases or exclusions attached to the bare native ID, crossing ownership boundaries between machines.\n\nArtifact sync initialization also has to publish pre-existing trash state for sessions peers already know about. Without a soft-delete baseline, enabling artifact sync after local curation could leave imported peers with restored-looking rows even though the source had already moved them to trash.\n\nThe frontend peer/status surfaces now use kit-ui primitives and tokens so the shared design-system gate can run after the PR rebase.
roborev: Combined Review (
|
CopySyncStateFrom now carries the artifact ledger keys (origin identity, metadata HLC, import/export watermarks) alongside the PG push marker, and ResyncAll copies the metadata_applied_events, metadata_replay_state, and metadata_conflicts tables into the rebuilt DB before the swap. Losing this state let already-applied peer events replay against an empty LWW register and could regenerate a DB-only origin.
Normalize file_path, file_size, file_mtime, file_inode, file_device, and file_hash out of the hashed session manifest. Import already discards these fields as local state, so a touch, move, or re-download of a source file re-hashed the manifest and forced peers to re-import unchanged content, clearing importer-local secret scan state.
roborev: Combined Review (
|
Watch mode silently ignored --init: every push hardcoded BaselineMetadata=false, so first-run curation baseline metadata was never published. The pusher now carries the flag and publishes the baseline on the first successful exchange, keeping it pending across failed pushes so a flaky initial exchange still baselines on retry (AppendBaselineSnapshot skips already-covered fields, so the retry is idempotent). Separately, an incoming peer exchange on an uninitialized server mints an origin only in DB sync state. A later CLI sync went through EnsureArtifactOriginID, which consults only the config and would generate a second origin that serve then adopts as authoritative, overwriting the DB origin and stranding metadata events published under it. CLI sync now resolves the origin via resolveArtifactOrigin: a config origin wins, otherwise a stored DB origin is promoted into config via the new AdoptArtifactOriginID, and only when neither exists is a new origin generated.
roborev: Combined Review (
|
Implements the local-first multi-machine sync design proposed in #692:
every machine keeps the complete archive and machines converge by
exchanging immutable, content-addressed artifacts over any dumb transport
instead of depending on an always-on PostgreSQL hub. SQLite stays a local,
rebuildable derivation — the live database file never crosses the wire.
Design rationale and the full set of alternatives considered (Automerge,
cr-sqlite, the SQLite session extension, whole-DB replication, raw-file
mirroring) are discussed in #692; user-facing setup is in
docs/artifact-sync.md.What this adds
write-once, content-addressed store under
$AGENTSVIEW_DATA_DIR/artifacts/<origin>/: append-only checkpoints,session manifests, zstd-compressed NDJSON message segments, a metadata
change feed, and an optional raw-source fallback. Serialization is a
pinned forever-contract enforced by golden tests; readers ignore unknown
fields and skip unknown future ops so mixed app versions keep syncing.
HLC timestamps render without
:so metadata filenames are valid onWindows.
name plus a random suffix). Foreign sessions are stored as
origin~nativeIDwithmachine=origin, the same convention SSHremote-sync already uses, so every read path, the UI, and analytics
render them without composite-PK surgery across backends. Server, CLI
folder sync, peer import, and conflict lookup converge on one persisted
origin via
AdoptOrigin.uploads, imports, SSH-pulled, and orphan-preserved sessions all publish;
it is debounced through the existing pg-watch sink loop. Import diffs
checkpoints against
artifact_sync_state, hash-verifies segments, andwrites foreign sessions through the existing
UpsertSession/messagepaths, inheriting FTS5 maintenance, tombstone rejection, and pin
re-attachment. Undelivered segments are recorded as phantoms and retried,
tolerating out-of-order delivery from dumb transports.
tiny HLC-stamped change events replayed deterministically with per-field
last-writer-wins. Recording is gated on opt-in: a machine appends ledger
events only once it has an artifact origin (via
sync --init, a syncrun, or an incoming peer exchange); until then curation stays local and
the
--initbaseline snapshot publishes it later. Concurrent conflicting edits are never silentlydropped: the losing value is logged to
meta_conflictsand surfaced inthe UI as a fork badge. Local edits record their own LWW register and
applied-event marker on write, so a later peer event with a lower order
key can no longer overwrite a newer local edit; replay advances the local
HLC past observed remote events to keep later local edits causally ahead;
and a single not-yet-durable target defers only its own event rather than
aborting the rest of an origin's replay.
syncverb, three interchangeable target shapesbehind a shared
Transportinterface (export -> set-union exchange ->import):
agentsview sync [--init|--watch] <dir>, safe forSyncthing, Dropbox, NFS, or rclone mounts because every file is
immutable temp+rename and single-writer-per-prefix.
agentsview sync https://peer:8080 [--token <t>]exchanges directly over the embedded server's artifact API behind the
existing Bearer-token middleware. A
GET /{origin}/indexrouteenumerates an origin's artifacts so metadata events (not referenced by
the checkpoint) can be pulled;
--tokendefaults to the local authtoken for a fleet sharing one symmetric token.
agentsview sync s3://bucket/prefixagainst anyS3-compatible store (AWS, MinIO, Backblaze B2). Requests are signed with
AWS Signature Version 4 implemented from the standard library, so there
is no AWS SDK dependency; credentials and addressing come from the
standard
AWS_*env vars plusAGENTSVIEW_S3_*overrides.from an origin's latest checkpoint) are reclaimed both on demand
(
agentsview sync gc [--dry-run] [--grace <d>] <dir>) and automaticallyafter a folder sync, over the local store and the shared target together
so set-union cannot re-propagate the deleted files. A grace window
protects slow peers, origins without checkpoints are skipped (never read
as a deletion), and
--gc-grace/--no-gctune or disable the automaticpass.
--watchkeeps any target shape syncing onchange plus a periodic floor through the pg-watch loop, and a peers page
shows each origin's published vs. locally-present session counts,
checkpoint sequence, last-published time, and total conflict count.
Scope, tradeoffs, and limitations
transports have no per-writer identity and the HTTP API uses one shared
token, so any peer can forge any origin's metadata. This is documented as
exactly that; per-peer tokens and origin signatures are the follow-up
before any sharing story.
append-mostly (a grow-only set), and metadata is a small append-only LWW
log, so a general CRDT library would add real cost without solving a
problem this data has.
compressed artifacts); zstd recovers 5-10x and GC reclaims superseded
bulk artifacts behind a grace window.
can reach the same transport; a NAS, bucket, or always-on peer is the
practical rendezvous by convention, not by privileged architecture.
tests and by a MinIO integration test (
make test-minio, run in CI) thatvalidates real S3 interop end to end; it has not been run against AWS itself.
Remote GC on an object store or HTTP peer is that peer's own responsibility —
auto-GC after a non-folder sync only collects the local store.
owner_markerpush design from the merged fix(postgres): preserve source machine on pg push #701/fix(postgres): guard pg push against same-id cross-machine row collision #724; the session-sinkseam is extracted (
drainSessionBatches) so PG is one sink and theartifact exporter can become another. PG read mode returns no
metadata-ledger conflicts (a parity stub in
internal/postgres/metadata.go).The change is additive: upgrading generates an origin id and behaves
exactly as today when sync is not configured. New tables arrive through the
existing idempotent migration path with no dataVersion bump and no resync.
Where to review
internal/artifact/format_test.gointernal/artifact/hlc.go,internal/artifact/replay.go,internal/db/metadata_replay.gointernal/artifact/sync.gointernal/artifact/transport.go,transport_http.go,transport_s3.gointernal/server/huma_routes_artifacts.go,internal/artifact/peer.gointernal/server/metadata_events.goand the frontendSessionBreadcrumb/TrashPagecomponentsinternal/artifact/twoinstance_test.go,internal/server/artifact_http_transport_test.go,internal/e2e/artifact_sync_test.goRelates to #692.
Claude Opus 4.8 reasoning-medium on behalf of maphew