Multimodal embedding: CLIP image+text preset, custom embedders, named vector spaces#26
Merged
Conversation
6cd2f91 to
83d24a6
Compare
Promote the internal _embedding_backend test hook to a supported embedder= parameter. A caller-supplied EngineEmbeddingBackend drives a text-capable index at its own native_dim, with its required_model_name pinned into the snapshot header and re-enforced on reopen. Mutually exclusive with vector_dim.
…earch_by_image) Add a multimodal preset backed by a sentence-transformers CLIP model that embeds text and images into one shared space, so text->image and image->image search run over the existing single-vector TurboVec scan with no storage or scoring change. - ClipEmbeddingBackend (engine): lazy sentence-transformers + Pillow, so a plain import lodedb pulls neither; guarded by a new import-boundary test. - clip-turbovec route profile + 'clip' preset (512-dim, 4-bit), wired through build_local_embedding_backend. - db.add_image / db.search_by_image: embed an image (path/bytes/PIL) and reuse the vector-in path; raw bytes are never stored (keep the path in metadata). - [image] extra adds only Pillow (CLIP rides the base sentence-transformers stack). - doctor reports image-embedding readiness.
Group several independent LodeDB indexes (spaces) under one root directory, each free to use a different model or dimension (e.g. a text space at model='minilm' beside an image space at model='clip'). A collection.json manifest records each space's (model, vector_dim, bit_width) and re-enforces it on reopen. Spaces are searched independently; there is no cross-space scoring, since vectors from different models are not comparable. The engine is unchanged.
Add docs/multimodal.md and examples/multimodal_clip.py, a README section, and the clip preset to the quickstart preset list.
Feeds every store the same precomputed CLIP-dimension vectors and reports ingest, on-disk footprint, query latency, and recall@k against the exact brute-force top-k. Competitors are optional (guarded imports).
…ed by store_text)
…recipe for images
The vendored TurboVec build fetches crates.io deps; the Windows runner intermittently fails with '[56] schannel: server closed abruptly (missing close_notify)'. Disable HTTP/2 multiplexing (the trigger) and raise the network retry count so a transient drop no longer fails the build.
… index identity on reopen Critical: under store_text=False the WAL serialized raw text (caption or document body), violating the no-raw-text-on-disk contract (WAL is the default commit mode). - Vector-in/image WAL payload drops text when store_text is off; replay rebuilds the row from the vector. Vector-only indexes keep the WAL with no leak. - Text-in replay re-embeds from the body, so it needs the text: a text-embedding index with store_text=False now resolves durability to generation (which persists compact codes, no raw text); an explicit commit_mode='wal' there is rejected. An engine-level guard backs this for direct callers. High: loading an index only checked dimension, so a same-dimension different-model backend reopened and served meaningless scores. _validate_loaded_state_identity now enforces persisted (model, native_dim) against the route policy and backend at open.
… and merge manifest writes - space() now enforces the requested (model, vector_dim, bit_width) before returning an already-open handle, so a mismatched in-process reopen fails immediately. - manifest writes take a collection-root advisory lock and read-merge-write, so a space another handle created since load is not lost to last-writer-wins.
…stalls image extra - uv.lock now records the [image] extra (uv lock --check was failing). - CI asserts uv lock --check and syncs --extra image so the optional extra is validated across the OS matrix.
…r identity is non-secret README quickstart and architecture.md still listed only minilm/bge. Also document that a custom embedder's required_model_name is persisted in the index header (re-enforced on reopen) and must be a non-secret public identifier.
add_images embeds a whole batch of images in a single embed_images call and stores
them in one atomic commit, instead of one encode + commit per add_image. Each item
is {image, id?, metadata?, text?}; the per-image storage contract matches add_image
(raw bytes never stored). Updates the example and docs to use it.
Add multimodal to the tagline and a Multimodal feature bullet linking to the existing section.
A text-embedding index with store_text=False now keeps WAL durability instead of falling back to generation. When raw-text storage is off, the WAL logs the ingest's chunk-level delta (added chunks with their embeddings + removed chunk ids + each doc's final chunk list/hash/metadata) rather than the raw body, and a new apply_embedded_documents replay rebuilds the rows on top of the committed base, so the WAL carries no raw text and replay never re-embeds. Removes the generation fallback and the reject guard. Adds a crash-replay regression test (uncheckpointed WAL -> reopen -> identical ranking, zero raw text on disk).
…ndexes Under index_text=True + store_text=False, apply_embedded_documents replay rebuilt chunks/vectors but not state.document_tokens, so lexical/hybrid search silently missed recovered docs after a crash. Log per-doc token lists in the embedded WAL payload (payload-derived terms the lexical index already persists, not raw text) and restore them on replay. Adds an uncheckpointed-WAL crash test for index_text=True, store_text=False + lexical search.
…images memory - A public embedder= must declare a non-empty required_model_name (pinned in the header, re-enforced on reopen), so a same-dimension different-model backend is rejected rather than silently scored. Identity-free fixtures stay on the internal _embedding_backend hook. - add_images decodes/encodes in backend-sized batches instead of loading the whole gallery at once, bounding peak memory; still one atomic commit.
space(name) compared fresh defaults (minilm/None/4) against the manifest, so a recorded vector-only or clip space failed to reopen with no config args. Use sentinel defaults: the recorded config is the reopen default, and only an explicit conflicting override raises.
…idate bit_width
_validate_loaded_state_identity now checks model, provider, task, native_dim, storage
profile, and TurboVec bit width against the route policy, not just model+dim. This
rejects a vector-only store reopened as a custom-embedder route (same model id, but
task differs) and a reopen at a different bit width. Public bit_width is validated to
{2, 4} at construction. Also narrows the add_images memory claim to decoded-image
batches (vectors still accumulate for one commit).
… spaces space() now records an explicit kind (preset/vector/custom) instead of a false 'minilm' for custom spaces. A custom space records its embedder identity and must be reopened with a matching embedder= (clear error otherwise; the identity is re-enforced when the index opens). Preset/vector spaces still reopen from the manifest with no args.
…esh architecture surface ClipEmbeddingBackend._load_image now opens images under 'with' so file/stream handles close promptly. architecture.md drops the stale 'four sidecars' line and lists the [image] extra / CLIP / LodeCollection / .tvlex in the storage and dependency sections.
…dows CI) _files_containing read .lodedb.lock, which the live writer holds byte-locked on Windows, raising PermissionError. Skip the lock sentinel by name (it carries no payload) and tolerate PermissionError; the WAL and data sidecars are opened/closed per write so they stay in scope.
…e guard + metrics - bit_width is now fixed by a preset (reject a conflicting explicit value) and only configurable on vector_dim= / embedder= indexes; LodeCollection records a preset space's effective width, never a caller value that would not take effect. - The legacy snapshot loader skips collection.json, so opening a collection root as a plain index no longer fails with a schema-version error. - ClipEmbeddingBackend guards against oversized / decompression-bomb images (LODEDB_MAX_IMAGE_PIXELS, ~64 MP default), checking the header before full decode. - add_images validates every item before embedding, so a bad item never wastes a CLIP batch; stats() reports per-handle image-embedding metrics (count, time, failures).
A space's privacy/indexing flags are recorded in collection.json and re-applied on
reopen, so col.space("name") restores the exact configuration it was created with;
a store_text=False space never silently flips back to retaining raw text. An explicit
flag that conflicts with the recorded one is rejected.
Replacing a text document with add_vectors/add_image(id=same) left stale tokens in state.document_tokens, so mode='lexical'/'hybrid' kept matching the old body. A vector document now sets its tokens from the text= caption (searchable when index_text=True, per the SDK contract) or to an empty list when there is no caption, which clears the stale postings and journals the clear into .tvlex. The vector WAL payload logs the caption tokens (payload-derived, not raw text) so a store_text=False crash-replay recovers them too.
…earch_by_image search_by_image now goes through the tracked-encode helper, and stats()['image_embedding'] splits counters into 'ingest' (add_image/add_images) and 'query' (search_by_image), so query-time CLIP encode latency and failures are observable, not just ingest.
05f30b9 to
8b337c1
Compare
…sync) A failed collection.json publish now rolls the space back (closes the child, releasing its writer lock, and restores the registry) instead of leaving an open, unregistered, locked space. The manifest honors durability='fsync' (and LODEDB_DURABILITY) so a durable space is never left invisible after a power loss; spaces inherit the collection's durability by default.
Coerce metadata/text before the CLIP encode (mirroring add_images), so a bad request wastes no encode and does not increment ingest metrics.
…stion + metrics scope Document <key>.wal in the payload-boundary section (architecture.md + README): what it holds by store_text/index_text mode, that persist()/close() checkpoint+truncate it, that read-only handles never read it, and that generation mode keeps no WAL payload, so operators classify it for backup/support/incident handling. Document add_images as an atomic batch (not streaming) with a recommended chunked loop for large galleries, and note that stats() image-embedding counters are per-handle and reset on reopen.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
clippreset: a sentence-transformers CLIP backend embeds images and text into one shared space, sodb.add_image()/db.add_images()and cross-modal text-to-image / image-to-image search run over the existing single-vector TurboVec scan with no storage or scoring change. CLIP rides the sentence-transformers stack (separate from the ONNX-default text runtime); the[image]extra adds only Pillow, both lazy-imported.db.add_images([...])embeds a gallery in backend-sized batches and commits once.embedder=argument so anyEngineEmbeddingBackendcan drive a text-capable index at its own dimension (it must declare a non-secretrequired_model_name).LodeCollection: named vector spaces (sibling indexes) under one root, each recorded with an explicit kind and reopened from a manifest.[image]extra,docs/multimodal.md,examples/multimodal_clip.py.benchmarks/multimodal_image/;doctorreports image-embedding readiness.The raw image bytes never reach LodeDB (keep the path in metadata); the on-disk format is unchanged.
Correctness and privacy (addresses all review rounds)
Rebased on current
main(includes #32's runtime-default hardening; the WAL-default constructor docs now read correctly).store_text=Falsekeeps raw text off disk, including the WAL. Vector/image WAL records drop the caption; text-in records log the chunk embedding delta (and, withindex_text=True, the per-chunk lexical tokens) instead of the body, andapply_embedded_documentsreplay rebuilds rows on the committed base, sostore_text=Falsetext indexes keep full WAL durability andmode="lexical"/"hybrid"recover after a crash, all with no raw text written.add_vectors/add_imageat the same id used to leave the old body's terms in the lexical index. The upsert now sets the document's tokens from thetext=caption (searchable whenindex_text=True) or to an empty list when there is no caption, clearing the stale postings; the change is journaled to.tvlexand the caption tokens are logged in the WAL, so it holds in the live handle and after a crash-replay.(model, provider, task, native_dim, storage profile, bit width)against the route policy, so a vector-only store cannot reopen as a custom-embedder route (same model id, different task) and a bit-width change is not silently ignored. A publicembedder=must declare a non-emptyrequired_model_name.bit_width. A preset's width is fixed by its route, so an explicit conflictingbit_widthfor a preset is now rejected rather than silently ignored;bit_widthis configurable only onvector_dim=/embedder=indexes, where it must be2or4.LodeCollectionrecords a preset space's effective width, never a caller value that would not take effect.LodeCollectionis a crash-safe registry. It records each space's kind (preset / vector / custom) and its privacy/indexing flags (store_text/index_text), and re-applies them on reopen, so astore_text=Falsespace never silently flips back to retaining raw text. The registry meets the engine's durability bar: the manifest honorsdurability="fsync"/LODEDB_DURABILITY(spaces inherit the collection's durability), and a failed manifest publish rolls the space back (closing it, releasing its lock, restoring the registry) instead of leaving an open, unregistered, locked space.col.space("name")reopens preset/vector spaces from the manifest with no args; a custom-embedder space is reopened with a matchingembedder=.space()enforces config before returning a cached handle; manifest writes take a lock + read-merge-write. The legacy snapshot loader skipscollection.json, so opening a collection root as a plain index no longer fails with a schema-version error.add_image/add_imagesvalidate the payload (metadata/text) before the encode, so a bad request wastes no CLIP call and skews no metrics; reject an image whose pixel count exceedsLODEDB_MAX_IMAGE_PIXELS(default ~64 MP) before the full decode (a decompression-bomb guard); open images under a context manager; and bound peak decoded-image memory to the batch.stats()["image_embedding"]reports per-handle encode count, time, and failures split by phase (ingestvsquery), sosearch_by_imageencode cost is visible too (no paths or captions).uv.locktracks[image]; CI runsuv lock --check, syncs--extra image, and hardens cargo crates.io fetches against the Windows schannel flake. The WAL-privacy file scan skips the writer's locked lock sentinel so it runs on Windows.Answers to the review questions
<key>.walbe classified as a payload-bearing store artifact? Yes. It is now documented as payload-bearing between checkpoints in the architecture payload-boundary section and the README: raw text understore_text=True, otherwise embedding deltas plus (withindex_text=True) lexical tokens;persist()/close()checkpoint and truncate it; read-only handles never read it;commit_mode="generation"keeps no WAL. Operators should treat it as the same data class as the.tvtext/.tvlexsidecars.add_imagesatomic-batch-only for 0.3.x? Yes for now. It commits the whole call atomically; the docstring and docs now recommend a chunked loop (oneadd_imagescall per chunk) for large galleries, with bounded memory and natural resume points. A first-class chunked/resumable helper is a tracked follow-up, not in this PR.LodeCollectionbe a crash-safe source of truth? Yes, and it now is: the manifest commits crash-atomically (tmp +os.replace), honors fsync durability, and a failed publish rolls back rather than orphaning a locked space.LODEDB_DURABILITY? Yes.LodeCollectionresolves durability exactly likeLodeDB(explicitdurability=arg, elseLODEDB_DURABILITY, else fast) for the manifest, and spaces inherit it by default.add_imageexpected behind untrusted input? It is now hardened for it: payload validation runs before the encode, and oversized/decompression-bomb images are rejected from the header before decode.text=captions participate in lexical/hybrid whenindex_text=True? Yes. A caption's tokens are indexed on upsert (and logged in the WAL), so it is found bymode="lexical"/"hybrid"; a captionless vector/image clears the id's postings instead.LodeCollectionown the privacy/indexing flags, or must callers restate them? It owns them:store_text/index_textare recorded in the manifest and re-applied on reopen, with a conflicting explicit override rejected.LodeCollection+ custom embedders: supported. A custom space records{kind: custom, model_identity, bit_width, store_text, index_text}and is reopened with a matchingembedder=; the identity is re-enforced when the underlying index opens.bit_width: only2and4are valid (TurboVec). It is fixed by a preset and configurable only onvector_dim=/embedder=indexes, enforced at the SDK boundary and on reopen.Out of scope / follow-ups
add_imageshelper (commit_every, progress, resume) for unbounded galleries; today it is one atomic commit per call, with a documented chunked loop.stats()["image_embedding"]is per-handle.lodedbCLI.Testing
uv run ruff check .,uv lock --checkuv run pytest -q(501 passed, 2 skipped)durability="fsync";add_imagewith invalid metadata does zero encodes.store_text/index_textacross reopen (the privacy repro); text->vector/image replacement clears stale lexical postings (same handle + reopen); a vector caption is lexically searchable and recovers from an uncheckpointed-WAL crash; image-embedding metrics split into ingest/query.store_text=Falsetext-in crash-replay (vector + lexical recovery, zero raw text); full-identity reopen rejection (vector-only/custom task collision, bit-width mismatch); preset rejects a conflictingbit_width; publicembedder=identity; collection custom-space record/reopen and plain-index-open-on-collection-root; oversized-image rejection; Pillow import-boundary laziness.