From 892c7438552db0e8f54f6c58baae4a16005aec7f Mon Sep 17 00:00:00 2001 From: prrao87 Date: Wed, 18 Feb 2026 15:52:40 -0500 Subject: [PATCH 1/2] Update supported index type --- docs/search/vector-search.mdx | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/docs/search/vector-search.mdx b/docs/search/vector-search.mdx index f7118d9..e330948 100644 --- a/docs/search/vector-search.mdx +++ b/docs/search/vector-search.mdx @@ -21,19 +21,30 @@ Ensure you always use the same distance metric that your embedding model was tra The right metric improves both search accuracy and query performance. Currently, LanceDB supports the following metrics: -| Metric | Description | Default | -| :-------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------ | -| `l2` | [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) - measures the straight-line distance between two points in vector space. Calculated as the square root of the sum of squared differences between corresponding vector components. | ✓ | -| `cosine` | [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) - measures the cosine of the angle between two vectors, ranging from -1 to 1. Computed as the dot product divided by the product of vector magnitudes. Use for unnormalized vectors. | x | -| `dot` | [Dot product](https://en.wikipedia.org/wiki/Dot_product) - calculates the sum of products of corresponding vector components. Provides raw similarity scores without normalization, sensitive to vector magnitudes. Use for normalized vectors for best performance. | x | -| `hamming` | [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) - counts the number of positions where corresponding bits differ between binary vectors. Only applicable to binary vectors stored as packed uint8 arrays. | x | +| Distance metric | Mathematical form | Notes | +|---|---|---| +| `l2` | $\|x-y\|_2=\sqrt{\sum_i (x_i-y_i)^2}$ | Measures the straight-line distance between two points in vector space. Calculated as the square root of the sum of squared differences between corresponding vector components. | +| `cosine` | $1-\frac{x\cdot y}{\|x\|_2\|y\|_2}$ | Measures directional difference between vectors. Computed as 1 minus cosine similarity (the dot product normalized by both vector magnitudes), so vector length does not affect the score. Use for unnormalized vectors. | +| `dot` | $x\cdot y=\sum_i x_i y_i$ | Calculates the sum of products of corresponding vector components. Provides raw similarity scores without normalization, sensitive to vector magnitudes. Use for normalized vectors for best performance. | +| `hamming` | $\sum_i \mathbf{1}[x_i\neq y_i]$ | Counts the number of positions where corresponding bits differ between binary vectors. Only applicable to binary vectors stored as packed uint8 arrays. | + +For indexed search, supported distance metrics vary by index type: + +| Index type | Supported distance metrics | +|---|---| +| `IVF_FLAT` | `["l2", "cosine", "dot", "hamming"]` | +| `IVF_PQ` | `["l2", "cosine", "dot"]` | +| `IVF_SQ` | `["l2", "cosine", "dot"]` | +| `IVF_RQ` | `["l2", "cosine", "dot"]` | +| `IVF_HNSW_PQ` | `["l2", "cosine", "dot"]` | +| `IVF_HNSW_SQ` | `["l2", "cosine", "dot"]` | ### Configure Distance Metric By default, `l2` will be used as metric type. You can specify the metric type as `cosine` or `dot` if required. -**Note:** You can configure the distance metric during search only if there’s no vector index. If a vector index exists, the distance metric will always be the one you specified when creating the index. +**Note:** You can configure the distance metric during search only if there's no vector index. If a vector index exists, the distance metric will always be the one you specified when creating the index. ```python Python icon="python" From 45aa6a0699b02791d030b234948d95b3c4ea4fc7 Mon Sep 17 00:00:00 2001 From: prrao87 Date: Wed, 18 Feb 2026 17:36:44 -0500 Subject: [PATCH 2/2] More clarity to HNSW index and vector search --- docs/indexing/index.mdx | 3 +-- docs/indexing/vector-index.mdx | 36 ++++++++++++++++++++-------------- docs/search/vector-search.mdx | 2 +- 3 files changed, 23 insertions(+), 18 deletions(-) diff --git a/docs/indexing/index.mdx b/docs/indexing/index.mdx index d8889eb..39c33b9 100644 --- a/docs/indexing/index.mdx +++ b/docs/indexing/index.mdx @@ -28,13 +28,12 @@ LanceDB provides a comprehensive suite of indexing strategies for different data | Index | Use Case | Description | | :--------- | :------- | :---------- | -| `HNSW` (Vector) | High recall and low latency vector searches. Ideal for applications requiring fast approximate nearest neighbor queries with high accuracy. | Hierarchical Navigable Small World—a graph-based approximate nearest neighbor algorithm.
Distance metrics: `l2` `cosine` `dot`
Quantizations: `PQ` `SQ`| | `IVF` (Vector) | Large-scale vector search with configurable accuracy/speed trade-offs. Supports binary vectors with hamming distance. | Inverted File Index—a partition-based approximate nearest neighbor algorithm that groups similar vectors into partitions for efficient search.
Distance metrics: `l2` `cosine` `dot` `hamming`
Quantizations: `None/Flat` `PQ` `SQ` `RQ`| | `IVF_HNSW` (Vector) | Large-scale vector search requiring both high recall and efficient partitioning. Combines the scalability of IVF with the search quality of HNSW. | Hybrid index combining IVF partitioning with HNSW graphs built within each partition. Provides improved search quality over pure IVF while maintaining scalability.
Distance metrics: `l2` `cosine` `dot`
Quantizations: `SQ`, `PQ`| +| `FTS` (Full-text search) | String columns (e.g., title, description, content) requiring keyword-based search with BM25 ranking. | Full-text search index using BM25 ranking algorithm. Tokenizes text with configurable tokenization, stemming, stop word removal, and language-specific processing. | | `BTree` (Scalar) | Numeric, temporal, and string columns with mostly distinct values. Best for highly selective queries on columns with many unique values. | Sorted index storing sorted copies of scalar columns with block headers in a btree cache. Header entries map to blocks of rows (4096 rows per block) for efficient disk reads. | | `Bitmap` (Scalar) | Low-cardinality columns with few thousand or fewer distinct values. Accelerates equality and range filters. | Stores a bitmap for each distinct value in the column, with one bit per row indicating presence. Memory-efficient for low-cardinality data. | | `LabelList` (Scalar) | List columns (e.g., tags, categories, keywords) requiring array containment queries. | Scalar index for `List` columns using an underlying bitmap index structure to enable fast array membership lookups. | -| `FTS` (Full-text) | String columns (e.g., title, description, content) requiring keyword-based search with BM25 ranking. | Full-text search index using BM25 ranking algorithm. Tokenizes text with configurable tokenization, stemming, stop word removal, and language-specific processing. | TypeScript currently doesn't support `IvfSq` (IVF with Scalar Quantization). diff --git a/docs/indexing/vector-index.mdx b/docs/indexing/vector-index.mdx index 379624d..2dae75c 100644 --- a/docs/indexing/vector-index.mdx +++ b/docs/indexing/vector-index.mdx @@ -1,7 +1,7 @@ --- title: "Vector Indexes" sidebarTitle: "Vector Index" -description: "Build and optimize LanceDB vector indexes, including IVF_HNSW_SQ, IVF_RQ, IVF_PQ, and binary indexes." +description: "Build and optimize LanceDB vector indexes, including IVF, HNSW and binary quantized indexes." icon: "arrow-up-right-dots" --- import { @@ -18,33 +18,39 @@ import { PyVectorIndexCheckStatus as VectorIndexCheckStatus, } from '/snippets/indexing.mdx'; -LanceDB offers two main vector indexing algorithms: **Inverted File (IVF)** and **Hierarchically Navigable Small Worlds (HNSW)**. You can create multiple vector indexes within a Lance table. This guide walks through common configurations and build patterns. +You can create and manage multiple vector indexes on any Lance dataset. LanceDB offers two kinds of vector indexing algorithms: **Inverted File (IVF)** and **Hierarchically Navigable Small Worlds (HNSW)**. -### Option 1: Self-Hosted Indexing + +**IVF + HNSW** -**Manual, Sync or Async:** If using LanceDB Open Source, you will have to build indexes manually, as well as reindex and tune indexing parameters. The Python SDK lets you do this *synchronously and asynchronously*. +In LanceDB, HNSW is not exposed as a top-level vector index. Instead, it's available as a sub-index inside IVF partitions. What this means in practice is that vectors are first partitioned by IVF, then each selected partition is searched using an HNSW graph (with quantization via `IVF_HNSW_PQ` / `IVF_HNSW_SQ`). This combines IVF's scalability with HNSW's higher-recall ANN search within partitions. + -### Option 2: Automated Indexing +### Manual Indexing -**Automatic and Async:** Indexing is automatic in LanceDB Cloud/Enterprise. As soon as data is updated, our system automates index optimization. *This is done asynchronously*. +If using LanceDB OSS, you will have to create the vector index manually, by calling `table.create_index()`, and updating the index as new data arrives and tuning its parameters is also a manual process. -Here is what happens in the background - when a table contains a single vector column named `vector`, LanceDB automatically: +### Automatic Indexing -- Infers the vector column from the schema -- Creates an optimized `IVF_PQ` index without manual configuration -- The default distance is `l2` or euclidean + Enterprise-only +Vector indexing is managed **automatically** in LanceDB Cloud/Enterprise. As soon as data is updated, the system updates the index and optimizates it. *This is done asynchronously as a background process*. -Finally, LanceDB Cloud/Enterprise will analyze your data distribution to **automatically configure indexing parameters**. +When you create a table in LanceDB Enterprise, LanceDB automatically: - -You can create a new index with different parameters using `create_index` - this replaces any existing index +- Infers the vector columns from the schema +- Create an optimized `IVF_PQ` index without manual configuration +- Automatically configure indexing parameters +The default distance is `l2` (Euclidean). + + +You can call `create_index()` with different parameters to create a new index -- this replaces any existing index. Although the `create_index` API returns immediately, the building of the vector index is asynchronous. To wait until all data is fully indexed, you can specify the `wait_timeout` parameter. ## Choose the Right Index -Use this table as a quick starting point: +Use this table as a quick starting point for choosing the right index type and quantization method for your use case: | If your top priority is... | Use this index | Why | Typical compressed size vs. raw vectors | | :--- | :--- | :--- | :--- | @@ -59,7 +65,7 @@ If your vector search frequently includes metadata filters (`where(...)`), prefe Compression ratios are practical rules of thumb and can vary with vector distribution, metric, and configuration. For small dimensions, choose `IVF_PQ` for accuracy, not for guaranteed higher compression than `IVF_RQ`. -### Indexing Tuning by Index Type +### Index Tuning Start with these values, then tune for your workload: diff --git a/docs/search/vector-search.mdx b/docs/search/vector-search.mdx index e330948..82008d3 100644 --- a/docs/search/vector-search.mdx +++ b/docs/search/vector-search.mdx @@ -42,7 +42,7 @@ For indexed search, supported distance metrics vary by index type: ### Configure Distance Metric By default, `l2` will be used as metric type. You can specify the metric type as -`cosine` or `dot` if required. +`cosine` or `dot` if required (`hamming` is supported for `IVF_FLAT` index only). **Note:** You can configure the distance metric during search only if there's no vector index. If a vector index exists, the distance metric will always be the one you specified when creating the index.