Skip to content

Multimodal embedding: CLIP image+text preset, custom embedders, named vector spaces#26

Merged
Davidobot merged 33 commits into
mainfrom
feat/multimodal-embedding
Jun 25, 2026
Merged

Multimodal embedding: CLIP image+text preset, custom embedders, named vector spaces#26
Davidobot merged 33 commits into
mainfrom
feat/multimodal-embedding

Conversation

@Davidobot

@Davidobot Davidobot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a clip preset: a sentence-transformers CLIP backend embeds images and text into one shared space, so db.add_image() / db.add_images() and cross-modal text-to-image / image-to-image search run over the existing single-vector TurboVec scan with no storage or scoring change. CLIP rides the sentence-transformers stack (separate from the ONNX-default text runtime); the [image] extra adds only Pillow, both lazy-imported.
  • db.add_images([...]) embeds a gallery in backend-sized batches and commits once.
  • Add a public embedder= argument so any EngineEmbeddingBackend can drive a text-capable index at its own dimension (it must declare a non-secret required_model_name).
  • Add LodeCollection: named vector spaces (sibling indexes) under one root, each recorded with an explicit kind and reopened from a manifest.
  • Promote bring-your-own-vectors: README section + feature bullet + [image] extra, docs/multimodal.md, examples/multimodal_clip.py.
  • Add benchmarks/multimodal_image/; doctor reports image-embedding readiness.

The raw image bytes never reach LodeDB (keep the path in metadata); the on-disk format is unchanged.

Correctness and privacy (addresses all review rounds)

Rebased on current main (includes #32's runtime-default hardening; the WAL-default constructor docs now read correctly).

  • store_text=False keeps raw text off disk, including the WAL. Vector/image WAL records drop the caption; text-in records log the chunk embedding delta (and, with index_text=True, the per-chunk lexical tokens) instead of the body, and apply_embedded_documents replay rebuilds rows on the committed base, so store_text=False text indexes keep full WAL durability and mode="lexical"/"hybrid" recover after a crash, all with no raw text written.
  • Vector/image upserts refresh lexical postings. Replacing a text document with add_vectors/add_image at the same id used to leave the old body's terms in the lexical index. The upsert now sets the document's tokens from the text= caption (searchable when index_text=True) or to an empty list when there is no caption, clearing the stale postings; the change is journaled to .tvlex and the caption tokens are logged in the WAL, so it holds in the live handle and after a crash-replay.
  • Full route identity enforced at open. Loading validates the persisted (model, provider, task, native_dim, storage profile, bit width) against the route policy, so a vector-only store cannot reopen as a custom-embedder route (same model id, different task) and a bit-width change is not silently ignored. A public embedder= must declare a non-empty required_model_name.
  • Honest bit_width. A preset's width is fixed by its route, so an explicit conflicting bit_width for a preset is now rejected rather than silently ignored; bit_width is configurable only on vector_dim= / embedder= indexes, where it must be 2 or 4. LodeCollection records a preset space's effective width, never a caller value that would not take effect.
  • LodeCollection is a crash-safe registry. It records each space's kind (preset / vector / custom) and its privacy/indexing flags (store_text/index_text), and re-applies them on reopen, so a store_text=False space never silently flips back to retaining raw text. The registry meets the engine's durability bar: the manifest honors durability="fsync"/LODEDB_DURABILITY (spaces inherit the collection's durability), and a failed manifest publish rolls the space back (closing it, releasing its lock, restoring the registry) instead of leaving an open, unregistered, locked space. col.space("name") reopens preset/vector spaces from the manifest with no args; a custom-embedder space is reopened with a matching embedder=. space() enforces config before returning a cached handle; manifest writes take a lock + read-merge-write. The legacy snapshot loader skips collection.json, so opening a collection root as a plain index no longer fails with a schema-version error.
  • Image ingestion is guarded and observable. add_image/add_images validate the payload (metadata/text) before the encode, so a bad request wastes no CLIP call and skews no metrics; reject an image whose pixel count exceeds LODEDB_MAX_IMAGE_PIXELS (default ~64 MP) before the full decode (a decompression-bomb guard); open images under a context manager; and bound peak decoded-image memory to the batch. stats()["image_embedding"] reports per-handle encode count, time, and failures split by phase (ingest vs query), so search_by_image encode cost is visible too (no paths or captions).
  • Build/CI: uv.lock tracks [image]; CI runs uv lock --check, syncs --extra image, and hardens cargo crates.io fetches against the Windows schannel flake. The WAL-privacy file scan skips the writer's locked lock sentinel so it runs on Windows.

Answers to the review questions

  • Should <key>.wal be classified as a payload-bearing store artifact? Yes. It is now documented as payload-bearing between checkpoints in the architecture payload-boundary section and the README: raw text under store_text=True, otherwise embedding deltas plus (with index_text=True) lexical tokens; persist()/close() checkpoint and truncate it; read-only handles never read it; commit_mode="generation" keeps no WAL. Operators should treat it as the same data class as the .tvtext/.tvlex sidecars.
  • Is add_images atomic-batch-only for 0.3.x? Yes for now. It commits the whole call atomically; the docstring and docs now recommend a chunked loop (one add_images call per chunk) for large galleries, with bounded memory and natural resume points. A first-class chunked/resumable helper is a tracked follow-up, not in this PR.
  • Should LodeCollection be a crash-safe source of truth? Yes, and it now is: the manifest commits crash-atomically (tmp + os.replace), honors fsync durability, and a failed publish rolls back rather than orphaning a locked space.
  • Should it inherit LODEDB_DURABILITY? Yes. LodeCollection resolves durability exactly like LodeDB (explicit durability= arg, else LODEDB_DURABILITY, else fast) for the manifest, and spaces inherit it by default.
  • Is add_image expected behind untrusted input? It is now hardened for it: payload validation runs before the encode, and oversized/decompression-bomb images are rejected from the header before decode.
  • Do vector/image text= captions participate in lexical/hybrid when index_text=True? Yes. A caption's tokens are indexed on upsert (and logged in the WAL), so it is found by mode="lexical"/"hybrid"; a captionless vector/image clears the id's postings instead.
  • Does LodeCollection own the privacy/indexing flags, or must callers restate them? It owns them: store_text/index_text are recorded in the manifest and re-applied on reopen, with a conflicting explicit override rejected.
  • LodeCollection + custom embedders: supported. A custom space records {kind: custom, model_identity, bit_width, store_text, index_text} and is reopened with a matching embedder=; the identity is re-enforced when the underlying index opens.
  • Public bit_width: only 2 and 4 are valid (TurboVec). It is fixed by a preset and configurable only on vector_dim= / embedder= indexes, enforced at the SDK boundary and on reopen.

Out of scope / follow-ups

  • Late-interaction / visual-document retrieval (ColPali/ColQwen): tracked in Late-interaction (multi-vector / MaxSim) retrieval for visual-document RAG #25.
  • A first-class chunked/resumable add_images helper (commit_every, progress, resume) for unbounded galleries; today it is one atomic commit per call, with a documented chunked loop.
  • Fleet-wide image-encode observability (cross-handle/cross-process aggregation, structured encode events); today stats()["image_embedding"] is per-handle.
  • A redacted metric/log for collection manifest-publish failures (today a failure raises a clear error and rolls back).
  • CLI image / collection verbs, if multimodal should be drivable from the lodedb CLI.

Testing

  • uv run ruff check ., uv lock --check
  • uv run pytest -q (501 passed, 2 skipped)
  • New round-6 tests: a fault-injected manifest-publish failure rolls the space back and leaves no leaked lock (a fresh collection reopens it); the manifest write uses fsync under durability="fsync"; add_image with invalid metadata does zero encodes.
  • New round-5 tests: collection persists/enforces store_text/index_text across reopen (the privacy repro); text->vector/image replacement clears stale lexical postings (same handle + reopen); a vector caption is lexically searchable and recovers from an uncheckpointed-WAL crash; image-embedding metrics split into ingest/query.
  • Earlier rounds: store_text=False text-in crash-replay (vector + lexical recovery, zero raw text); full-identity reopen rejection (vector-only/custom task collision, bit-width mismatch); preset rejects a conflicting bit_width; public embedder= identity; collection custom-space record/reopen and plain-index-open-on-collection-root; oversized-image rejection; Pillow import-boundary laziness.
  • Smoke-tested the real CLIP path; ran the multimodal benchmark locally (no regression).

@Davidobot Davidobot force-pushed the feat/multimodal-embedding branch from 6cd2f91 to 83d24a6 Compare June 25, 2026 03:49
Davidobot added 29 commits June 25, 2026 08:41
Promote the internal _embedding_backend test hook to a supported embedder=
parameter. A caller-supplied EngineEmbeddingBackend drives a text-capable index
at its own native_dim, with its required_model_name pinned into the snapshot
header and re-enforced on reopen. Mutually exclusive with vector_dim.
…earch_by_image)

Add a multimodal preset backed by a sentence-transformers CLIP model that embeds
text and images into one shared space, so text->image and image->image search run
over the existing single-vector TurboVec scan with no storage or scoring change.

- ClipEmbeddingBackend (engine): lazy sentence-transformers + Pillow, so a plain
  import lodedb pulls neither; guarded by a new import-boundary test.
- clip-turbovec route profile + 'clip' preset (512-dim, 4-bit), wired through
  build_local_embedding_backend.
- db.add_image / db.search_by_image: embed an image (path/bytes/PIL) and reuse the
  vector-in path; raw bytes are never stored (keep the path in metadata).
- [image] extra adds only Pillow (CLIP rides the base sentence-transformers stack).
- doctor reports image-embedding readiness.
Group several independent LodeDB indexes (spaces) under one root directory, each
free to use a different model or dimension (e.g. a text space at model='minilm'
beside an image space at model='clip'). A collection.json manifest records each
space's (model, vector_dim, bit_width) and re-enforces it on reopen. Spaces are
searched independently; there is no cross-space scoring, since vectors from
different models are not comparable. The engine is unchanged.
Add docs/multimodal.md and examples/multimodal_clip.py, a README section, and the
clip preset to the quickstart preset list.
Feeds every store the same precomputed CLIP-dimension vectors and reports ingest,
on-disk footprint, query latency, and recall@k against the exact brute-force top-k.
Competitors are optional (guarded imports).
The vendored TurboVec build fetches crates.io deps; the Windows runner
intermittently fails with '[56] schannel: server closed abruptly (missing
close_notify)'. Disable HTTP/2 multiplexing (the trigger) and raise the network
retry count so a transient drop no longer fails the build.
… index identity on reopen

Critical: under store_text=False the WAL serialized raw text (caption or document
body), violating the no-raw-text-on-disk contract (WAL is the default commit mode).

- Vector-in/image WAL payload drops text when store_text is off; replay rebuilds
  the row from the vector. Vector-only indexes keep the WAL with no leak.
- Text-in replay re-embeds from the body, so it needs the text: a text-embedding
  index with store_text=False now resolves durability to generation (which persists
  compact codes, no raw text); an explicit commit_mode='wal' there is rejected. An
  engine-level guard backs this for direct callers.

High: loading an index only checked dimension, so a same-dimension different-model
backend reopened and served meaningless scores. _validate_loaded_state_identity now
enforces persisted (model, native_dim) against the route policy and backend at open.
… and merge manifest writes

- space() now enforces the requested (model, vector_dim, bit_width) before returning
  an already-open handle, so a mismatched in-process reopen fails immediately.
- manifest writes take a collection-root advisory lock and read-merge-write, so a
  space another handle created since load is not lost to last-writer-wins.
…stalls image extra

- uv.lock now records the [image] extra (uv lock --check was failing).
- CI asserts uv lock --check and syncs --extra image so the optional extra is
  validated across the OS matrix.
…r identity is non-secret

README quickstart and architecture.md still listed only minilm/bge. Also document
that a custom embedder's required_model_name is persisted in the index header
(re-enforced on reopen) and must be a non-secret public identifier.
add_images embeds a whole batch of images in a single embed_images call and stores
them in one atomic commit, instead of one encode + commit per add_image. Each item
is {image, id?, metadata?, text?}; the per-image storage contract matches add_image
(raw bytes never stored). Updates the example and docs to use it.
Add multimodal to the tagline and a Multimodal feature bullet linking to the
existing section.
A text-embedding index with store_text=False now keeps WAL durability instead of
falling back to generation. When raw-text storage is off, the WAL logs the ingest's
chunk-level delta (added chunks with their embeddings + removed chunk ids + each
doc's final chunk list/hash/metadata) rather than the raw body, and a new
apply_embedded_documents replay rebuilds the rows on top of the committed base, so
the WAL carries no raw text and replay never re-embeds. Removes the generation
fallback and the reject guard. Adds a crash-replay regression test (uncheckpointed
WAL -> reopen -> identical ranking, zero raw text on disk).
…ndexes

Under index_text=True + store_text=False, apply_embedded_documents replay rebuilt
chunks/vectors but not state.document_tokens, so lexical/hybrid search silently
missed recovered docs after a crash. Log per-doc token lists in the embedded WAL
payload (payload-derived terms the lexical index already persists, not raw text)
and restore them on replay. Adds an uncheckpointed-WAL crash test for
index_text=True, store_text=False + lexical search.
…images memory

- A public embedder= must declare a non-empty required_model_name (pinned in the
  header, re-enforced on reopen), so a same-dimension different-model backend is
  rejected rather than silently scored. Identity-free fixtures stay on the internal
  _embedding_backend hook.
- add_images decodes/encodes in backend-sized batches instead of loading the whole
  gallery at once, bounding peak memory; still one atomic commit.
space(name) compared fresh defaults (minilm/None/4) against the manifest, so a
recorded vector-only or clip space failed to reopen with no config args. Use
sentinel defaults: the recorded config is the reopen default, and only an explicit
conflicting override raises.
…idate bit_width

_validate_loaded_state_identity now checks model, provider, task, native_dim, storage
profile, and TurboVec bit width against the route policy, not just model+dim. This
rejects a vector-only store reopened as a custom-embedder route (same model id, but
task differs) and a reopen at a different bit width. Public bit_width is validated to
{2, 4} at construction. Also narrows the add_images memory claim to decoded-image
batches (vectors still accumulate for one commit).
… spaces

space() now records an explicit kind (preset/vector/custom) instead of a false
'minilm' for custom spaces. A custom space records its embedder identity and must be
reopened with a matching embedder= (clear error otherwise; the identity is re-enforced
when the index opens). Preset/vector spaces still reopen from the manifest with no args.
…esh architecture surface

ClipEmbeddingBackend._load_image now opens images under 'with' so file/stream handles
close promptly. architecture.md drops the stale 'four sidecars' line and lists the
[image] extra / CLIP / LodeCollection / .tvlex in the storage and dependency sections.
…dows CI)

_files_containing read .lodedb.lock, which the live writer holds byte-locked on
Windows, raising PermissionError. Skip the lock sentinel by name (it carries no
payload) and tolerate PermissionError; the WAL and data sidecars are opened/closed
per write so they stay in scope.
…e guard + metrics

- bit_width is now fixed by a preset (reject a conflicting explicit value) and only
  configurable on vector_dim= / embedder= indexes; LodeCollection records a preset
  space's effective width, never a caller value that would not take effect.
- The legacy snapshot loader skips collection.json, so opening a collection root as a
  plain index no longer fails with a schema-version error.
- ClipEmbeddingBackend guards against oversized / decompression-bomb images
  (LODEDB_MAX_IMAGE_PIXELS, ~64 MP default), checking the header before full decode.
- add_images validates every item before embedding, so a bad item never wastes a
  CLIP batch; stats() reports per-handle image-embedding metrics (count, time,
  failures).
A space's privacy/indexing flags are recorded in collection.json and re-applied on
reopen, so col.space("name") restores the exact configuration it was created with;
a store_text=False space never silently flips back to retaining raw text. An explicit
flag that conflicts with the recorded one is rejected.
Replacing a text document with add_vectors/add_image(id=same) left stale tokens in
state.document_tokens, so mode='lexical'/'hybrid' kept matching the old body. A vector
document now sets its tokens from the text= caption (searchable when index_text=True,
per the SDK contract) or to an empty list when there is no caption, which clears the
stale postings and journals the clear into .tvlex. The vector WAL payload logs the
caption tokens (payload-derived, not raw text) so a store_text=False crash-replay
recovers them too.
…earch_by_image

search_by_image now goes through the tracked-encode helper, and stats()['image_embedding']
splits counters into 'ingest' (add_image/add_images) and 'query' (search_by_image), so
query-time CLIP encode latency and failures are observable, not just ingest.
@Davidobot Davidobot force-pushed the feat/multimodal-embedding branch from 05f30b9 to 8b337c1 Compare June 25, 2026 15:59
…sync)

A failed collection.json publish now rolls the space back (closes the child,
releasing its writer lock, and restores the registry) instead of leaving an open,
unregistered, locked space. The manifest honors durability='fsync' (and LODEDB_DURABILITY)
so a durable space is never left invisible after a power loss; spaces inherit the
collection's durability by default.
Coerce metadata/text before the CLIP encode (mirroring add_images), so a bad request
wastes no encode and does not increment ingest metrics.
…stion + metrics scope

Document <key>.wal in the payload-boundary section (architecture.md + README): what it
holds by store_text/index_text mode, that persist()/close() checkpoint+truncate it,
that read-only handles never read it, and that generation mode keeps no WAL payload, so
operators classify it for backup/support/incident handling. Document add_images as an
atomic batch (not streaming) with a recommended chunked loop for large galleries, and
note that stats() image-embedding counters are per-handle and reset on reopen.
@Davidobot Davidobot merged commit d4c6c2d into main Jun 25, 2026
4 checks passed
@Davidobot Davidobot deleted the feat/multimodal-embedding branch June 25, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant