Multimodal embedding: CLIP image+text preset, custom embedders, named vector spaces by Davidobot · Pull Request #26 · Egoist-Machines/LodeDB

Davidobot · 2026-06-24T22:46:58Z

Summary

Add a clip preset: a sentence-transformers CLIP backend embeds images and text into one shared space, so db.add_image() / db.add_images() and cross-modal text-to-image / image-to-image search run over the existing single-vector TurboVec scan with no storage or scoring change. CLIP rides the sentence-transformers stack (separate from the ONNX-default text runtime); the [image] extra adds only Pillow, both lazy-imported.
db.add_images([...]) embeds a gallery in backend-sized batches and commits once.
Add a public embedder= argument so any EngineEmbeddingBackend can drive a text-capable index at its own dimension (it must declare a non-secret required_model_name).
Add LodeCollection: named vector spaces (sibling indexes) under one root, each recorded with an explicit kind and reopened from a manifest.
Promote bring-your-own-vectors: README section + feature bullet + [image] extra, docs/multimodal.md, examples/multimodal_clip.py.
Add benchmarks/multimodal_image/; doctor reports image-embedding readiness.

The raw image bytes never reach LodeDB (keep the path in metadata); the on-disk format is unchanged.

Correctness and privacy (addresses all review rounds)

Rebased on current main (includes #32's runtime-default hardening; the WAL-default constructor docs now read correctly).

store_text=False keeps raw text off disk, including the WAL. Vector/image WAL records drop the caption; text-in records log the chunk embedding delta (and, with index_text=True, the per-chunk lexical tokens) instead of the body, and apply_embedded_documents replay rebuilds rows on the committed base, so store_text=False text indexes keep full WAL durability and mode="lexical"/"hybrid" recover after a crash, all with no raw text written.
Vector/image upserts refresh lexical postings. Replacing a text document with add_vectors/add_image at the same id used to leave the old body's terms in the lexical index. The upsert now sets the document's tokens from the text= caption (searchable when index_text=True) or to an empty list when there is no caption, clearing the stale postings; the change is journaled to .tvlex and the caption tokens are logged in the WAL, so it holds in the live handle and after a crash-replay.
Full route identity enforced at open. Loading validates the persisted (model, provider, task, native_dim, storage profile, bit width) against the route policy, so a vector-only store cannot reopen as a custom-embedder route (same model id, different task) and a bit-width change is not silently ignored. A public embedder= must declare a non-empty required_model_name.
Honest bit_width. A preset's width is fixed by its route, so an explicit conflicting bit_width for a preset is now rejected rather than silently ignored; bit_width is configurable only on vector_dim= / embedder= indexes, where it must be 2 or 4. LodeCollection records a preset space's effective width, never a caller value that would not take effect.
LodeCollection is a crash-safe registry. It records each space's kind (preset / vector / custom) and its privacy/indexing flags (store_text/index_text), and re-applies them on reopen, so a store_text=False space never silently flips back to retaining raw text. The registry meets the engine's durability bar: the manifest honors durability="fsync"/LODEDB_DURABILITY (spaces inherit the collection's durability), and a failed manifest publish rolls the space back (closing it, releasing its lock, restoring the registry) instead of leaving an open, unregistered, locked space. col.space("name") reopens preset/vector spaces from the manifest with no args; a custom-embedder space is reopened with a matching embedder=. space() enforces config before returning a cached handle; manifest writes take a lock + read-merge-write. The legacy snapshot loader skips collection.json, so opening a collection root as a plain index no longer fails with a schema-version error.
Image ingestion is guarded and observable. add_image/add_images validate the payload (metadata/text) before the encode, so a bad request wastes no CLIP call and skews no metrics; reject an image whose pixel count exceeds LODEDB_MAX_IMAGE_PIXELS (default ~64 MP) before the full decode (a decompression-bomb guard); open images under a context manager; and bound peak decoded-image memory to the batch. stats()["image_embedding"] reports per-handle encode count, time, and failures split by phase (ingest vs query), so search_by_image encode cost is visible too (no paths or captions).
Build/CI: uv.lock tracks [image]; CI runs uv lock --check, syncs --extra image, and hardens cargo crates.io fetches against the Windows schannel flake. The WAL-privacy file scan skips the writer's locked lock sentinel so it runs on Windows.

Answers to the review questions

Should <key>.wal be classified as a payload-bearing store artifact? Yes. It is now documented as payload-bearing between checkpoints in the architecture payload-boundary section and the README: raw text under store_text=True, otherwise embedding deltas plus (with index_text=True) lexical tokens; persist()/close() checkpoint and truncate it; read-only handles never read it; commit_mode="generation" keeps no WAL. Operators should treat it as the same data class as the .tvtext/.tvlex sidecars.
Is add_images atomic-batch-only for 0.3.x? Yes for now. It commits the whole call atomically; the docstring and docs now recommend a chunked loop (one add_images call per chunk) for large galleries, with bounded memory and natural resume points. A first-class chunked/resumable helper is a tracked follow-up, not in this PR.
Should LodeCollection be a crash-safe source of truth? Yes, and it now is: the manifest commits crash-atomically (tmp + os.replace), honors fsync durability, and a failed publish rolls back rather than orphaning a locked space.
Should it inherit LODEDB_DURABILITY? Yes. LodeCollection resolves durability exactly like LodeDB (explicit durability= arg, else LODEDB_DURABILITY, else fast) for the manifest, and spaces inherit it by default.
Is add_image expected behind untrusted input? It is now hardened for it: payload validation runs before the encode, and oversized/decompression-bomb images are rejected from the header before decode.
Do vector/image text= captions participate in lexical/hybrid when index_text=True? Yes. A caption's tokens are indexed on upsert (and logged in the WAL), so it is found by mode="lexical"/"hybrid"; a captionless vector/image clears the id's postings instead.
Does LodeCollection own the privacy/indexing flags, or must callers restate them? It owns them: store_text/index_text are recorded in the manifest and re-applied on reopen, with a conflicting explicit override rejected.
LodeCollection + custom embedders: supported. A custom space records {kind: custom, model_identity, bit_width, store_text, index_text} and is reopened with a matching embedder=; the identity is re-enforced when the underlying index opens.
Public bit_width: only 2 and 4 are valid (TurboVec). It is fixed by a preset and configurable only on vector_dim= / embedder= indexes, enforced at the SDK boundary and on reopen.

Out of scope / follow-ups

Late-interaction / visual-document retrieval (ColPali/ColQwen): tracked in Late-interaction (multi-vector / MaxSim) retrieval for visual-document RAG #25.
A first-class chunked/resumable add_images helper (commit_every, progress, resume) for unbounded galleries; today it is one atomic commit per call, with a documented chunked loop.
Fleet-wide image-encode observability (cross-handle/cross-process aggregation, structured encode events); today stats()["image_embedding"] is per-handle.
A redacted metric/log for collection manifest-publish failures (today a failure raises a clear error and rolls back).
CLI image / collection verbs, if multimodal should be drivable from the lodedb CLI.

Testing

uv run ruff check ., uv lock --check
uv run pytest -q (501 passed, 2 skipped)
New round-6 tests: a fault-injected manifest-publish failure rolls the space back and leaves no leaked lock (a fresh collection reopens it); the manifest write uses fsync under durability="fsync"; add_image with invalid metadata does zero encodes.
New round-5 tests: collection persists/enforces store_text/index_text across reopen (the privacy repro); text->vector/image replacement clears stale lexical postings (same handle + reopen); a vector caption is lexically searchable and recovers from an uncheckpointed-WAL crash; image-embedding metrics split into ingest/query.
Earlier rounds: store_text=False text-in crash-replay (vector + lexical recovery, zero raw text); full-identity reopen rejection (vector-only/custom task collision, bit-width mismatch); preset rejects a conflicting bit_width; public embedder= identity; collection custom-space record/reopen and plain-index-open-on-collection-root; oversized-image rejection; Pillow import-boundary laziness.
Smoke-tested the real CLIP path; ran the multimodal benchmark locally (no regression).

Promote the internal _embedding_backend test hook to a supported embedder= parameter. A caller-supplied EngineEmbeddingBackend drives a text-capable index at its own native_dim, with its required_model_name pinned into the snapshot header and re-enforced on reopen. Mutually exclusive with vector_dim.

…earch_by_image) Add a multimodal preset backed by a sentence-transformers CLIP model that embeds text and images into one shared space, so text->image and image->image search run over the existing single-vector TurboVec scan with no storage or scoring change. - ClipEmbeddingBackend (engine): lazy sentence-transformers + Pillow, so a plain import lodedb pulls neither; guarded by a new import-boundary test. - clip-turbovec route profile + 'clip' preset (512-dim, 4-bit), wired through build_local_embedding_backend. - db.add_image / db.search_by_image: embed an image (path/bytes/PIL) and reuse the vector-in path; raw bytes are never stored (keep the path in metadata). - [image] extra adds only Pillow (CLIP rides the base sentence-transformers stack). - doctor reports image-embedding readiness.

Group several independent LodeDB indexes (spaces) under one root directory, each free to use a different model or dimension (e.g. a text space at model='minilm' beside an image space at model='clip'). A collection.json manifest records each space's (model, vector_dim, bit_width) and re-enforces it on reopen. Spaces are searched independently; there is no cross-space scoring, since vectors from different models are not comparable. The engine is unchanged.

Add docs/multimodal.md and examples/multimodal_clip.py, a README section, and the clip preset to the quickstart preset list.

Feeds every store the same precomputed CLIP-dimension vectors and reports ingest, on-disk footprint, query latency, and recall@k against the exact brute-force top-k. Competitors are optional (guarded imports).

…ed by store_text)

…recipe for images

The vendored TurboVec build fetches crates.io deps; the Windows runner intermittently fails with '[56] schannel: server closed abruptly (missing close_notify)'. Disable HTTP/2 multiplexing (the trigger) and raise the network retry count so a transient drop no longer fails the build.

… index identity on reopen Critical: under store_text=False the WAL serialized raw text (caption or document body), violating the no-raw-text-on-disk contract (WAL is the default commit mode). - Vector-in/image WAL payload drops text when store_text is off; replay rebuilds the row from the vector. Vector-only indexes keep the WAL with no leak. - Text-in replay re-embeds from the body, so it needs the text: a text-embedding index with store_text=False now resolves durability to generation (which persists compact codes, no raw text); an explicit commit_mode='wal' there is rejected. An engine-level guard backs this for direct callers. High: loading an index only checked dimension, so a same-dimension different-model backend reopened and served meaningless scores. _validate_loaded_state_identity now enforces persisted (model, native_dim) against the route policy and backend at open.

… and merge manifest writes - space() now enforces the requested (model, vector_dim, bit_width) before returning an already-open handle, so a mismatched in-process reopen fails immediately. - manifest writes take a collection-root advisory lock and read-merge-write, so a space another handle created since load is not lost to last-writer-wins.

…stalls image extra - uv.lock now records the [image] extra (uv lock --check was failing). - CI asserts uv lock --check and syncs --extra image so the optional extra is validated across the OS matrix.

…r identity is non-secret README quickstart and architecture.md still listed only minilm/bge. Also document that a custom embedder's required_model_name is persisted in the index header (re-enforced on reopen) and must be a non-secret public identifier.

add_images embeds a whole batch of images in a single embed_images call and stores them in one atomic commit, instead of one encode + commit per add_image. Each item is {image, id?, metadata?, text?}; the per-image storage contract matches add_image (raw bytes never stored). Updates the example and docs to use it.

Add multimodal to the tagline and a Multimodal feature bullet linking to the existing section.

A text-embedding index with store_text=False now keeps WAL durability instead of falling back to generation. When raw-text storage is off, the WAL logs the ingest's chunk-level delta (added chunks with their embeddings + removed chunk ids + each doc's final chunk list/hash/metadata) rather than the raw body, and a new apply_embedded_documents replay rebuilds the rows on top of the committed base, so the WAL carries no raw text and replay never re-embeds. Removes the generation fallback and the reject guard. Adds a crash-replay regression test (uncheckpointed WAL -> reopen -> identical ranking, zero raw text on disk).

…ndexes Under index_text=True + store_text=False, apply_embedded_documents replay rebuilt chunks/vectors but not state.document_tokens, so lexical/hybrid search silently missed recovered docs after a crash. Log per-doc token lists in the embedded WAL payload (payload-derived terms the lexical index already persists, not raw text) and restore them on replay. Adds an uncheckpointed-WAL crash test for index_text=True, store_text=False + lexical search.

…images memory - A public embedder= must declare a non-empty required_model_name (pinned in the header, re-enforced on reopen), so a same-dimension different-model backend is rejected rather than silently scored. Identity-free fixtures stay on the internal _embedding_backend hook. - add_images decodes/encodes in backend-sized batches instead of loading the whole gallery at once, bounding peak memory; still one atomic commit.

space(name) compared fresh defaults (minilm/None/4) against the manifest, so a recorded vector-only or clip space failed to reopen with no config args. Use sentinel defaults: the recorded config is the reopen default, and only an explicit conflicting override raises.

…pportedError

…idate bit_width _validate_loaded_state_identity now checks model, provider, task, native_dim, storage profile, and TurboVec bit width against the route policy, not just model+dim. This rejects a vector-only store reopened as a custom-embedder route (same model id, but task differs) and a reopen at a different bit width. Public bit_width is validated to {2, 4} at construction. Also narrows the add_images memory claim to decoded-image batches (vectors still accumulate for one commit).

… spaces space() now records an explicit kind (preset/vector/custom) instead of a false 'minilm' for custom spaces. A custom space records its embedder identity and must be reopened with a matching embedder= (clear error otherwise; the identity is re-enforced when the index opens). Preset/vector spaces still reopen from the manifest with no args.

…esh architecture surface ClipEmbeddingBackend._load_image now opens images under 'with' so file/stream handles close promptly. architecture.md drops the stale 'four sidecars' line and lists the [image] extra / CLIP / LodeCollection / .tvlex in the storage and dependency sections.

…dows CI) _files_containing read .lodedb.lock, which the live writer holds byte-locked on Windows, raising PermissionError. Skip the lock sentinel by name (it carries no payload) and tolerate PermissionError; the WAL and data sidecars are opened/closed per write so they stay in scope.

…e guard + metrics - bit_width is now fixed by a preset (reject a conflicting explicit value) and only configurable on vector_dim= / embedder= indexes; LodeCollection records a preset space's effective width, never a caller value that would not take effect. - The legacy snapshot loader skips collection.json, so opening a collection root as a plain index no longer fails with a schema-version error. - ClipEmbeddingBackend guards against oversized / decompression-bomb images (LODEDB_MAX_IMAGE_PIXELS, ~64 MP default), checking the header before full decode. - add_images validates every item before embedding, so a bad item never wastes a CLIP batch; stats() reports per-handle image-embedding metrics (count, time, failures).

A space's privacy/indexing flags are recorded in collection.json and re-applied on reopen, so col.space("name") restores the exact configuration it was created with; a store_text=False space never silently flips back to retaining raw text. An explicit flag that conflicts with the recorded one is rejected.

Replacing a text document with add_vectors/add_image(id=same) left stale tokens in state.document_tokens, so mode='lexical'/'hybrid' kept matching the old body. A vector document now sets its tokens from the text= caption (searchable when index_text=True, per the SDK contract) or to an empty list when there is no caption, which clears the stale postings and journals the clear into .tvlex. The vector WAL payload logs the caption tokens (payload-derived, not raw text) so a store_text=False crash-replay recovers them too.

…earch_by_image search_by_image now goes through the tracked-encode helper, and stats()['image_embedding'] splits counters into 'ingest' (add_image/add_images) and 'query' (search_by_image), so query-time CLIP encode latency and failures are observable, not just ingest.

…sync) A failed collection.json publish now rolls the space back (closes the child, releasing its writer lock, and restores the registry) instead of leaving an open, unregistered, locked space. The manifest honors durability='fsync' (and LODEDB_DURABILITY) so a durable space is never left invisible after a power loss; spaces inherit the collection's durability by default.

Coerce metadata/text before the CLIP encode (mirroring add_images), so a bad request wastes no encode and does not increment ingest metrics.

…h-safety

…stion + metrics scope Document <key>.wal in the payload-boundary section (architecture.md + README): what it holds by store_text/index_text mode, that persist()/close() checkpoint+truncate it, that read-only handles never read it, and that generation mode keeps no WAL payload, so operators classify it for backup/support/incident handling. Document add_images as an atomic batch (not streaming) with a recommended chunked loop for large galleries, and note that stats() image-embedding counters are per-handle and reset on reopen.

Davidobot force-pushed the feat/multimodal-embedding branch from 6cd2f91 to 83d24a6 Compare June 25, 2026 03:49

Davidobot added 29 commits June 25, 2026 08:41

docs: document multimodal (CLIP) and bring-your-own-vector search

1884593

Add docs/multimodal.md and examples/multimodal_clip.py, a README section, and the clip preset to the quickstart preset list.

bench: add image-vector storage benchmark (LodeDB vs Chroma/Qdrant)

1c2b87c

Feeds every store the same precomputed CLIP-dimension vectors and reports ingest, on-disk footprint, query latency, and recall@k against the exact brute-force top-k. Competitors are optional (guarded imports).

docs: clarify add_image storage (vector + metadata; text= caption gat…

90123cd

…ed by store_text)

docs: document the reference-by-metadata pattern and portable bundle …

378c6d8

…recipe for images

build: refresh uv.lock for the image extra; CI checks lockfile and in…

9876b54

…stalls image extra - uv.lock now records the [image] extra (uv lock --check was failing). - CI asserts uv lock --check and syncs --extra image so the optional extra is validated across the OS matrix.

docs: surface multimodal at the top of the README

1623ce1

Add multimodal to the tagline and a Multimodal feature bullet linking to the existing section.

README copy

041ecb1

docs: changelog entry + README image extra; export ImageEmbeddingUnsu…

2fd7c18

…pportedError

docs: changelog for multimodal collection/lexical/metrics hardening

8b337c1

Davidobot force-pushed the feat/multimodal-embedding branch from 05f30b9 to 8b337c1 Compare June 25, 2026 15:59

Davidobot added 4 commits June 25, 2026 09:21

fix(local): validate add_image payload before embedding

5fcf11c

Coerce metadata/text before the CLIP encode (mirroring add_images), so a bad request wastes no encode and does not increment ingest metrics.

docs: embedder= required_model_name is required; note collection cras…

aa02546

…h-safety

Davidobot merged commit d4c6c2d into main Jun 25, 2026
4 checks passed

Davidobot deleted the feat/multimodal-embedding branch June 25, 2026 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multimodal embedding: CLIP image+text preset, custom embedders, named vector spaces#26

Multimodal embedding: CLIP image+text preset, custom embedders, named vector spaces#26
Davidobot merged 33 commits into
mainfrom
feat/multimodal-embedding

Davidobot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Davidobot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Correctness and privacy (addresses all review rounds)

Answers to the review questions

Out of scope / follow-ups

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Davidobot commented Jun 24, 2026 •

edited

Loading