Skip to content

feat(migrate): lodedb migrate toolkit for existing vector stores#36

Merged
Davidobot merged 4 commits into
mainfrom
feat/migrate-toolkit
Jun 26, 2026
Merged

feat(migrate): lodedb migrate toolkit for existing vector stores#36
Davidobot merged 4 commits into
mainfrom
feat/migrate-toolkit

Conversation

@Davidobot

Copy link
Copy Markdown
Contributor

Summary

  • New lodedb migrate sub-app: a plan-first, non-destructive toolkit that moves an existing LangChain, LlamaIndex, or mem0 store, or a direct pgvector table, onto a local LodeDB path. The flow is inspect, plan, dry-run, run, then validate. The source store is only ever read.
  • Detection routes a framework owner (LangChain/LlamaIndex/mem0) ahead of any direct provider beneath it; direct pgvector is the first provider-first path. Ambiguous projects stop and ask for --framework/--provider.
  • Source exporters: LangChain InMemoryVectorStore, LlamaIndex SimpleVectorStore, mem0 Qdrant, and direct pgvector (read-only, keyset-paginated, with information_schema/atttypmod column and dimension discovery).
  • run writes to a temp dir, reopens it read-only, validates count/sample/text and the persisted-index audit, then renames into place. The default is a dry run; --write performs the migration. An existing target is never clobbered.
  • Plans and the migration.json manifest are payload-free; connection strings are redacted and never persisted (re-supply with --source at run time).
  • Switch snippets cover vector-preserve and text-owned SDK usage.
  • Public agent pages docs/migrate-agent.md (framework) and docs/install-agent.md (provider-first router), linked from docs/integrations.md and the README. These still need publishing to egoistmachines.com/lodedb/{migrate-agent,install-agent}.
  • audit_persisted_index_snapshots and the index loader skip migration.json so a manifest can live in a migrated store directory, mirroring collection.json.

Deferred to follow-ups: direct exporters for Qdrant (non-mem0), Chroma, LanceDB, sqlite-vec, and FAISS (detected and reported today, export pending); vector-preserve mode for LangChain/LlamaIndex; representative-query overlap during validate; live-server integration tests.

Closes #34
Closes #35

Testing

  • uv run pytest tests/test_migrate_toolkit.py tests/test_migrate_pgvector_and_cli.py -q: 34 passed
  • uv run pytest -q (full suite): 535 passed, 2 skipped
  • uv run ruff check .: clean

Add a plan-first, non-destructive `lodedb migrate` sub-app that moves an
existing LangChain, LlamaIndex, or mem0 store, or a direct pgvector table,
onto a local LodeDB path. The flow is inspect, plan, dry-run, run, then
validate; the source store is only ever read.

- Detection routes a framework owner (LangChain/LlamaIndex/mem0) ahead of any
  direct provider beneath it; direct pgvector is the first provider-first path.
  Ambiguous projects stop and ask for --framework/--provider.
- Source exporters: LangChain InMemoryVectorStore, LlamaIndex SimpleVectorStore,
  mem0 Qdrant, and direct pgvector (read-only, keyset-paginated, with
  information_schema/atttypmod column and dimension discovery).
- run writes to a temp dir, reopens it read-only, validates count/sample/text
  and the persisted-index audit, then renames into place. The default is a dry
  run; --write performs the migration. An existing target is never clobbered.
- Plans and the migration.json manifest are payload-free; connection strings
  are redacted and never persisted (re-supply with --source at run time).
- Switch snippets cover vector-preserve and text-owned SDK usage.
- Public agent pages docs/migrate-agent.md (framework) and docs/install-agent.md
  (provider-first router), linked from docs/integrations.md and the README.

audit_persisted_index_snapshots and the index loader now skip migration.json so
a manifest can live in a migrated store directory, mirroring collection.json.

Closes #34
Closes #35
…eopen, batch writes

Address review findings on the migration runner:

- run no longer publishes a failed migration. After validation it records the
  manifest in the temp dir, and only moves the store into the target when
  validation passed. A failed run raises and leaves any existing target
  unchanged, so a failed --overwrite-target run cannot replace a valid store.
- Validation reopens vector-preserve targets with the effective dimension (the
  source-discovered dimension when the plan did not pin --embedding-dim) rather
  than the dim-8 fallback, so direct pgvector migrations that rely on dimension
  discovery validate correctly.
- Writes are buffered into bounded batches and flushed through the batch SDK
  APIs (add_many, add_vectors_many, adapter batch inserts), so a large migration
  pays one commit and one embedding pass per batch instead of per row.
- Validation compares a bounded sample of source rows against the target (id
  presence, scalar-metadata subset, stored text after reopen) and counts via
  count() instead of materializing every document.
- migration.json is written via durable_replace so a crash mid-write cannot
  leave a partial manifest.
- The published agent pages cross-link the website URLs rather than repo files.

Tests cover failed-validation-does-not-publish, failed-overwrite-leaves-target,
discovered-dimension validation, and batched write behavior.
…d overlap threshold

Follow-up review nits:

- The unpublished temp manifest of a failed run now reports status "failed". The
  run status is set before the inspection manifest is written, so
  target.tmp/migration.json no longer says "migrated" when validation failed.
- Drop the query-overlap / query-sample thresholds from the plan and the plan
  Markdown, and stop advertising a representative query-overlap check in the
  migrate docs. Representative-query overlap is not enforced yet, so the plan and
  docs only state the checks the runner actually runs (count parity, the
  id/metadata/text sample, stored-text recovery, persisted-index audit).
…son tables

Add a short call-out beneath the comparison tables pointing a project's coding
assistant at the public install-agent page to migrate an existing store onto LodeDB.
@Davidobot Davidobot merged commit 2406a50 into main Jun 26, 2026
4 checks passed
@Davidobot Davidobot deleted the feat/migrate-toolkit branch June 26, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Public agent install flow for migrating local projects from existing vector providers Migration toolkit for LangChain, LlamaIndex, and mem0 stores

1 participant