From a49f41dd371b11d0d70993ff173820ecb2f04801 Mon Sep 17 00:00:00 2001 From: iamvirul Date: Wed, 4 Mar 2026 23:38:28 +0530 Subject: [PATCH] docs: update README and CHANGELOG for BYOK embedding providers - README: document cloud providers (OpenAI, Voyage, Gemini) with API key env vars, optional install extras, provider lock behaviour, updated tool signatures (index_codebase gains provider param), and updated get_index_status example showing provider/model/dims fields - CHANGELOG: add [Unreleased] entry covering BYOK cloud providers, strategy- pattern EmbeddingProvider ABC, dynamic vector dims, provider lock, live-sync guard, VectorStore meta helpers, and new test suite --- CHANGELOG.md | 58 ++++++++++++++++++++++++++++++++++++++++++++ README.md | 68 ++++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 119 insertions(+), 7 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 183a861..beff98d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,64 @@ All notable changes to VecGrep are documented here. --- +## [Unreleased] — feat/byok-embedding-providers + +### Added + +- **BYOK cloud embedding providers** — bring your own API key to use OpenAI, + Voyage AI, or Google Gemini embeddings instead of the default local model. + Pass `provider="openai"` (or `"voyage"` / `"gemini"`) to `index_codebase`. + + | Provider | Model | Dims | API key env var | Install extra | + |---|---|---|---|---| + | `openai` | `text-embedding-3-small` | 1536 | `VECGREP_OPENAI_KEY` | `vecgrep[openai]` | + | `voyage` | `voyage-code-3` | 1024 | `VECGREP_VOYAGE_KEY` | `vecgrep[voyage]` | + | `gemini` | `gemini-embedding-exp-03-07` | 3072 | `VECGREP_GEMINI_KEY` | `vecgrep[gemini]` | + +- **Strategy-pattern `EmbeddingProvider` ABC** — `LocalProvider`, + `OpenAIProvider`, `VoyageProvider`, and `GeminiProvider` all implement the + same `embed(texts) → np.ndarray` interface. Adding new providers only + requires subclassing `EmbeddingProvider`. + +- **Dynamic vector dimensions** — `VectorStore` now stores the embedding + dimensionality in the meta table and creates the LanceDB schema with the + correct dims for the chosen provider (384 / 1024 / 1536 / 3072). Backward + compatible: existing 384-dim indexes open without migration. + +- **Provider lock** — once a project index is built with a provider, re- + indexing with a different provider requires `force=True`. This prevents + silent dimension mismatches. The lock is stored in the per-project meta + table; switching with `force=True` drops and recreates the chunks table. + +- **`get_index_status` now reports provider metadata** — the `Provider`, + `Model`, and `Dimensions` fields are printed in the status output. + +- **Optional dependency extras in `pyproject.toml`** — + `vecgrep[openai]`, `vecgrep[voyage]`, `vecgrep[gemini]`, `vecgrep[cloud]` + install only the packages needed for the chosen provider. + +### Changed + +- **Live-sync guard for cloud providers** — `watch=True` is rejected for any + non-local provider; live file-change sync with cloud embeddings would + incur unbounded API costs. + +- **`_get_meta` / `_set_meta` helpers on `VectorStore`** — refactored + manual meta-table queries into reusable key/value helpers used throughout + the store. + +### Tests + +- Added a full BYOK test suite (`tests/test_providers.py`) covering: + - Provider registry (`get_provider`, unknown provider errors) + - `LocalProvider` shape, dtype, and L2-normalization + - Backward-compatible `embed()` free function + - Cloud providers raising `RuntimeError` when API key or package is missing + - `OpenAIProvider`, `VoyageProvider`, `GeminiProvider` with mocked API + responses (shape, dtype, normalization, empty-input edge cases) + +--- + ## [1.6.0] — 2026-03-02 ### Added diff --git a/README.md b/README.md index dd51359..24b0ce9 100644 --- a/README.md +++ b/README.md @@ -11,8 +11,10 @@ Instead of grepping 50 files and sending 30,000 tokens to Claude, VecGrep return ## How it works 1. **Chunk** — Parses source files with tree-sitter to extract semantic units (functions, classes, methods) -2. **Embed** — Encodes each chunk locally using [`all-MiniLM-L6-v2-code-search-512`](https://huggingface.co/isuruwijesiri/all-MiniLM-L6-v2-code-search-512) (384-dim, ~80MB one-time download) via the fastembed ONNX backend (~100ms startup) or PyTorch, automatically using Metal (Apple Silicon), CUDA (NVIDIA), or CPU -3. **Store** — Saves embeddings + metadata in LanceDB under `~/.vecgrep//` +2. **Embed** — Encodes each chunk using the configured embedding provider: + - **Local** (default) — [`all-MiniLM-L6-v2-code-search-512`](https://huggingface.co/isuruwijesiri/all-MiniLM-L6-v2-code-search-512) via fastembed ONNX (~100ms startup, no API key) or PyTorch, with auto device detection (Apple Silicon, CUDA, CPU) + - **Cloud (BYOK)** — OpenAI, Voyage AI, or Google Gemini via your own API key (higher-quality embeddings, optional) +3. **Store** — Saves embeddings + metadata in LanceDB under `~/.vecgrep//`; vector dimensions adapt automatically to the chosen provider 4. **Search** — ANN index (IVF-PQ) for fast approximate search on large codebases Incremental re-indexing via mtime/size checks skips unchanged files. @@ -66,15 +68,22 @@ After the first index, subsequent searches skip unchanged files automatically ## Tools -### `index_codebase(path, force=False)` +### `index_codebase(path, force=False, watch=False, provider=None)` Index a project directory. Skips unchanged files on subsequent calls. ``` index_codebase("/path/to/myproject") # → "Indexed 142 file(s), 1847 chunk(s) added (0 file(s) skipped, unchanged)" + +# Use OpenAI embeddings instead of local +index_codebase("/path/to/myproject", provider="openai") ``` +**Provider lock**: once a project is indexed with a provider, re-indexing with a different provider requires `force=True` (this rebuilds the vector table with the new embedding dimensions). + +**Note:** `watch=True` is only supported with the `local` provider — live sync with cloud providers would incur unbounded API costs. + ### `search_code(query, path, top_k=8)` Semantic search. Auto-indexes if no index exists. @@ -96,7 +105,7 @@ def authenticate_user(token: str) -> User: ### `get_index_status(path)` -Check index statistics. +Check index statistics, including the embedding provider used. ``` Index status for: /path/to/myproject @@ -104,16 +113,21 @@ Index status for: /path/to/myproject Total chunks: 1847 Last indexed: 2026-02-22T07:20:31+00:00 Index size: 28.4 MB + Provider: local + Model: isuruwijesiri/all-MiniLM-L6-v2-code-search-512 + Dimensions: 384 ``` ## Configuration VecGrep can be tuned via environment variables: +### Local provider + | Variable | Default | Description | |---|---|---| -| `VECGREP_BACKEND` | `onnx` | Embedding backend: `onnx` (fastembed, fast startup) or `torch` (sentence-transformers, any HF model) | -| `VECGREP_MODEL` | `isuruwijesiri/all-MiniLM-L6-v2-code-search-512` | HuggingFace model ID to use for embeddings | +| `VECGREP_BACKEND` | `onnx` | Local backend: `onnx` (fastembed, fast startup) or `torch` (sentence-transformers, any HF model) | +| `VECGREP_MODEL` | `isuruwijesiri/all-MiniLM-L6-v2-code-search-512` | HuggingFace model ID (local provider only) | **Backend comparison:** @@ -122,7 +136,47 @@ VecGrep can be tuned via environment variables: | `onnx` (default) | ~100ms | No | ONNX-exported models only | | `torch` | ~2–3s | Yes | Any HuggingFace model | -**Examples:** +### Cloud providers (BYOK — Bring Your Own Key) + +VecGrep supports three cloud embedding providers. Each requires an API key environment variable and the corresponding optional dependency. + +| Provider | Env var | Model | Dims | Install extra | +|---|---|---|---|---| +| `openai` | `VECGREP_OPENAI_KEY` | `text-embedding-3-small` | 1536 | `vecgrep[openai]` | +| `voyage` | `VECGREP_VOYAGE_KEY` | `voyage-code-3` | 1024 | `vecgrep[voyage]` | +| `gemini` | `VECGREP_GEMINI_KEY` | `gemini-embedding-exp-03-07` | 3072 | `vecgrep[gemini]` | + +**Install cloud extras:** + +```bash +# Single provider +uv tool install --python 3.12 'vecgrep[openai]' +pip install 'vecgrep[openai]' + +# All cloud providers at once +pip install 'vecgrep[cloud]' +``` + +**Use a cloud provider:** + +```bash +# Set your API key +export VECGREP_OPENAI_KEY=sk-... + +# Index with OpenAI embeddings +index_codebase("/path/to/myproject", provider="openai") + +# Or tell Claude to use it: +# "Index my project at /path/to/myproject using openai embeddings" +``` + +**Switch providers** (requires force re-index to rebuild the vector table): + +``` +index_codebase("/path/to/myproject", provider="voyage", force=True) +``` + +**Local backend examples:** ```bash # Use a different model with the torch backend