From a49f41dd371b11d0d70993ff173820ecb2f04801 Mon Sep 17 00:00:00 2001
From: iamvirul <virulwickramasinghe@gmail.com>
Date: Wed, 4 Mar 2026 23:38:28 +0530
Subject: [PATCH] docs: update README and CHANGELOG for BYOK embedding
 providers

- README: document cloud providers (OpenAI, Voyage, Gemini) with API key env
  vars, optional install extras, provider lock behaviour, updated tool
  signatures (index_codebase gains provider param), and updated
  get_index_status example showing provider/model/dims fields
- CHANGELOG: add [Unreleased] entry covering BYOK cloud providers, strategy-
  pattern EmbeddingProvider ABC, dynamic vector dims, provider lock,
  live-sync guard, VectorStore meta helpers, and new test suite
---
 CHANGELOG.md | 58 ++++++++++++++++++++++++++++++++++++++++++++
 README.md    | 68 ++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 119 insertions(+), 7 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 183a861..beff98d 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,64 @@ All notable changes to VecGrep are documented here.
 
 ---
 
+## [Unreleased] — feat/byok-embedding-providers
+
+### Added
+
+- **BYOK cloud embedding providers** — bring your own API key to use OpenAI,
+  Voyage AI, or Google Gemini embeddings instead of the default local model.
+  Pass `provider="openai"` (or `"voyage"` / `"gemini"`) to `index_codebase`.
+
+  | Provider | Model | Dims | API key env var | Install extra |
+  |---|---|---|---|---|
+  | `openai` | `text-embedding-3-small` | 1536 | `VECGREP_OPENAI_KEY` | `vecgrep[openai]` |
+  | `voyage` | `voyage-code-3` | 1024 | `VECGREP_VOYAGE_KEY` | `vecgrep[voyage]` |
+  | `gemini` | `gemini-embedding-exp-03-07` | 3072 | `VECGREP_GEMINI_KEY` | `vecgrep[gemini]` |
+
+- **Strategy-pattern `EmbeddingProvider` ABC** — `LocalProvider`,
+  `OpenAIProvider`, `VoyageProvider`, and `GeminiProvider` all implement the
+  same `embed(texts) → np.ndarray` interface. Adding new providers only
+  requires subclassing `EmbeddingProvider`.
+
+- **Dynamic vector dimensions** — `VectorStore` now stores the embedding
+  dimensionality in the meta table and creates the LanceDB schema with the
+  correct dims for the chosen provider (384 / 1024 / 1536 / 3072). Backward
+  compatible: existing 384-dim indexes open without migration.
+
+- **Provider lock** — once a project index is built with a provider, re-
+  indexing with a different provider requires `force=True`. This prevents
+  silent dimension mismatches. The lock is stored in the per-project meta
+  table; switching with `force=True` drops and recreates the chunks table.
+
+- **`get_index_status` now reports provider metadata** — the `Provider`,
+  `Model`, and `Dimensions` fields are printed in the status output.
+
+- **Optional dependency extras in `pyproject.toml`** —
+  `vecgrep[openai]`, `vecgrep[voyage]`, `vecgrep[gemini]`, `vecgrep[cloud]`
+  install only the packages needed for the chosen provider.
+
+### Changed
+
+- **Live-sync guard for cloud providers** — `watch=True` is rejected for any
+  non-local provider; live file-change sync with cloud embeddings would
+  incur unbounded API costs.
+
+- **`_get_meta` / `_set_meta` helpers on `VectorStore`** — refactored
+  manual meta-table queries into reusable key/value helpers used throughout
+  the store.
+
+### Tests
+
+- Added a full BYOK test suite (`tests/test_providers.py`) covering:
+  - Provider registry (`get_provider`, unknown provider errors)
+  - `LocalProvider` shape, dtype, and L2-normalization
+  - Backward-compatible `embed()` free function
+  - Cloud providers raising `RuntimeError` when API key or package is missing
+  - `OpenAIProvider`, `VoyageProvider`, `GeminiProvider` with mocked API
+    responses (shape, dtype, normalization, empty-input edge cases)
+
+---
+
 ## [1.6.0] — 2026-03-02
 
 ### Added
diff --git a/README.md b/README.md
index dd51359..24b0ce9 100644
--- a/README.md
+++ b/README.md
@@ -11,8 +11,10 @@ Instead of grepping 50 files and sending 30,000 tokens to Claude, VecGrep return
 ## How it works
 
 1. **Chunk** — Parses source files with tree-sitter to extract semantic units (functions, classes, methods)
-2. **Embed** — Encodes each chunk locally using [`all-MiniLM-L6-v2-code-search-512`](https://huggingface.co/isuruwijesiri/all-MiniLM-L6-v2-code-search-512) (384-dim, ~80MB one-time download) via the fastembed ONNX backend (~100ms startup) or PyTorch, automatically using Metal (Apple Silicon), CUDA (NVIDIA), or CPU
-3. **Store** — Saves embeddings + metadata in LanceDB under `~/.vecgrep/<project_hash>/`
+2. **Embed** — Encodes each chunk using the configured embedding provider:
+   - **Local** (default) — [`all-MiniLM-L6-v2-code-search-512`](https://huggingface.co/isuruwijesiri/all-MiniLM-L6-v2-code-search-512) via fastembed ONNX (~100ms startup, no API key) or PyTorch, with auto device detection (Apple Silicon, CUDA, CPU)
+   - **Cloud (BYOK)** — OpenAI, Voyage AI, or Google Gemini via your own API key (higher-quality embeddings, optional)
+3. **Store** — Saves embeddings + metadata in LanceDB under `~/.vecgrep/<project_hash>/`; vector dimensions adapt automatically to the chosen provider
 4. **Search** — ANN index (IVF-PQ) for fast approximate search on large codebases
 
 Incremental re-indexing via mtime/size checks skips unchanged files.
@@ -66,15 +68,22 @@ After the first index, subsequent searches skip unchanged files automatically 
 
 ## Tools
 
-### `index_codebase(path, force=False)`
+### `index_codebase(path, force=False, watch=False, provider=None)`
 
 Index a project directory. Skips unchanged files on subsequent calls.
 
 ```
 index_codebase("/path/to/myproject")
 # → "Indexed 142 file(s), 1847 chunk(s) added (0 file(s) skipped, unchanged)"
+
+# Use OpenAI embeddings instead of local
+index_codebase("/path/to/myproject", provider="openai")
 ```
 
+**Provider lock**: once a project is indexed with a provider, re-indexing with a different provider requires `force=True` (this rebuilds the vector table with the new embedding dimensions).
+
+**Note:** `watch=True` is only supported with the `local` provider — live sync with cloud providers would incur unbounded API costs.
+
 ### `search_code(query, path, top_k=8)`
 
 Semantic search. Auto-indexes if no index exists.
@@ -96,7 +105,7 @@ def authenticate_user(token: str) -> User:
 
 ### `get_index_status(path)`
 
-Check index statistics.
+Check index statistics, including the embedding provider used.
 
 ```
 Index status for: /path/to/myproject
@@ -104,16 +113,21 @@ Index status for: /path/to/myproject
   Total chunks:   1847
   Last indexed:   2026-02-22T07:20:31+00:00
   Index size:     28.4 MB
+  Provider:       local
+  Model:          isuruwijesiri/all-MiniLM-L6-v2-code-search-512
+  Dimensions:     384
 ```
 
 ## Configuration
 
 VecGrep can be tuned via environment variables:
 
+### Local provider
+
 | Variable | Default | Description |
 |---|---|---|
-| `VECGREP_BACKEND` | `onnx` | Embedding backend: `onnx` (fastembed, fast startup) or `torch` (sentence-transformers, any HF model) |
-| `VECGREP_MODEL` | `isuruwijesiri/all-MiniLM-L6-v2-code-search-512` | HuggingFace model ID to use for embeddings |
+| `VECGREP_BACKEND` | `onnx` | Local backend: `onnx` (fastembed, fast startup) or `torch` (sentence-transformers, any HF model) |
+| `VECGREP_MODEL` | `isuruwijesiri/all-MiniLM-L6-v2-code-search-512` | HuggingFace model ID (local provider only) |
 
 **Backend comparison:**
 
@@ -122,7 +136,47 @@ VecGrep can be tuned via environment variables:
 | `onnx` (default) | ~100ms | No | ONNX-exported models only |
 | `torch` | ~2–3s | Yes | Any HuggingFace model |
 
-**Examples:**
+### Cloud providers (BYOK — Bring Your Own Key)
+
+VecGrep supports three cloud embedding providers. Each requires an API key environment variable and the corresponding optional dependency.
+
+| Provider | Env var | Model | Dims | Install extra |
+|---|---|---|---|---|
+| `openai` | `VECGREP_OPENAI_KEY` | `text-embedding-3-small` | 1536 | `vecgrep[openai]` |
+| `voyage` | `VECGREP_VOYAGE_KEY` | `voyage-code-3` | 1024 | `vecgrep[voyage]` |
+| `gemini` | `VECGREP_GEMINI_KEY` | `gemini-embedding-exp-03-07` | 3072 | `vecgrep[gemini]` |
+
+**Install cloud extras:**
+
+```bash
+# Single provider
+uv tool install --python 3.12 'vecgrep[openai]'
+pip install 'vecgrep[openai]'
+
+# All cloud providers at once
+pip install 'vecgrep[cloud]'
+```
+
+**Use a cloud provider:**
+
+```bash
+# Set your API key
+export VECGREP_OPENAI_KEY=sk-...
+
+# Index with OpenAI embeddings
+index_codebase("/path/to/myproject", provider="openai")
+
+# Or tell Claude to use it:
+# "Index my project at /path/to/myproject using openai embeddings"
+```
+
+**Switch providers** (requires force re-index to rebuild the vector table):
+
+```
+index_codebase("/path/to/myproject", provider="voyage", force=True)
+```
+
+**Local backend examples:**
 
 ```bash
 # Use a different model with the torch backend