Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/api-reference/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Other language SDKs are available through examples or third-party contributions.

| SDK Examples | Description |
|:--------------|-------------------|
| [Java API Quickstart](https://github.com/lancedb/vectordb-recipes/tree/main/examples/saas_examples/rest_api_example)| Streamline REST API interactions in Java|
| [Java API Quickstart]https://lancedb.github.io/lancedb/java/java/)| Streamline REST API interactions in Java|

{/* TODO: Add Go bindings reference page here */}

143 changes: 60 additions & 83 deletions docs/embedding/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,63 +5,56 @@ description: "Use the embedding API in LanceDB -- registry, functions, schemas,
icon: "bars"
---

import {
PyOpenaiEmbeddings,
PyManualQuerySearch,
PyEmbeddingFunction,
TsOpenaiEmbeddings,
TsManualQuerySearch,
TsEmbeddingFunction,
RsOpenaiEmbeddings,
RsManualQuerySearch,
RsEmbeddingFunction,
} from '/snippets/embedding.mdx';

Modern machine learning models can be trained to convert raw data into embeddings, which are vectors
of floating point numbers. The position of an embedding in vector space captures the semantics of
the data, so vectors that are close to each other are considered similar.

LanceDB provides an embedding function registry in OSS as well as its Cloud and Enterprise versions
([see below](#embeddings-in-lancedb-cloud-and-enterprise))
that automatically generates vector embeddings during data ingestion. Automatic query-time embedding
generation is currently only supported in LanceDB OSS. The API abstracts embedding generation, allowing
you to focus on your application logic.
generation is available in LanceDB OSS, with SDK-specific query ergonomics. The API abstracts
embedding generation, allowing you to focus on your application logic.

## Embedding Registry

<Badge color="green">OSS</Badge>

In LanceDB OSS, you can get a supported embedding function from the registry, and then use it in your table schema.
Once configured, the embedding function will automatically generate embeddings when you insert data
into the table. And when you query the table, you can provide a query string or other input, and the
embedding function will generate an embedding for it.
into the table. Query-time behavior depends on SDK: Python/TypeScript can query with text directly,
while Rust examples typically compute query embeddings explicitly before vector search.

<CodeGroup>
```python Python icon="python"
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector

# Get a sentence-transformer function
func = get_registry().get("sentence-transformers").create()

class MySchema(LanceModel):
# Embed the 'text' field automatically
text: str = func.SourceField()
# Store the embeddings in the 'vector' field
vector: Vector(func.ndims()) = func.VectorField()

# Create a LanceDB table with the schema
import lancedb
db = lancedb.connect("./mydb")
table = db.create_table("mytable", schema=MySchema)
# Insert data - embeddings are generated automatically
table.add([
{"text": "This is a test."},
{"text": "Another example."}
])

# Query the table - embeddings are generated for the query
results = table.search("test example").limit(5).to_pandas()
print(results)

## Example Output
# text vector _distance
# 0 This is a test. [0.0123, -0.0456, ..., 0.0789] 0.123456
# 1 Another example. [0.0234, -0.0567, ..., 0.0890] 0.234567
```
<CodeBlock filename="Python" language="python" icon="python">
{PyOpenaiEmbeddings}
</CodeBlock>

<CodeBlock filename="TypeScript" language="typescript" icon="square-js">
{TsOpenaiEmbeddings}
</CodeBlock>

<CodeBlock filename="Rust" language="rust" icon="rust">
{RsOpenaiEmbeddings}
</CodeBlock>
</CodeGroup>

### Using an embedding function

The `.create()` method accepts several arguments to configure the embedding function's behavior. `max_retries` is a special argument that applies to all providers.
<Badge color="green">Python SDK</Badge>

In the Python SDK, the `.create()` method accepts several arguments to configure embedding function behavior. `max_retries` is a special argument that applies to all providers.

| Argument | Type | Description |
|---|---|---|
Expand Down Expand Up @@ -108,59 +101,43 @@ Currently, the embedding registry on LanceDB <Badge color="purple">Cloud</Badge>
generated on the client side (and stored on the remote table). We don't yet support automatic query-time
embedding generation when sending queries, though this is planned for a future release.

For now, you can manually generate the embeddings at query time using the same embedding function that
<Note>
The same manual query-embedding flow shown above in OSS applies to Cloud and Enterprise connections.
</Note>

For search, you can manually generate the embeddings at query time using the same embedding function that
was used during ingestion, and pass the embeddings to the search function.

<CodeGroup>
```python Python icon="python"
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector

db = lancedb.connect(...)
func = get_registry().get("sentence-transformers").create()

class MySchema(LanceModel):
text: str = func.SourceField()
vector: Vector(func.ndims()) = func.VectorField()

table = db.create_table("mytable", schema=MySchema)
table.add([
{"text": "This is a test."},
{"text": "Another example."}
])

# Manually generate embeddings for the query
query_vector = func.generate_embeddings(["test example"])[0]
results = table.search(query_vector).limit(5).to_pandas()
```
<CodeBlock filename="Python" language="python" icon="python">
{PyManualQuerySearch}
</CodeBlock>

<CodeBlock filename="TypeScript" language="typescript" icon="square-js">
{TsManualQuerySearch}
</CodeBlock>

<CodeBlock filename="Rust" language="rust" icon="rust">
{RsManualQuerySearch}
</CodeBlock>
</CodeGroup>

## Custom Embedding Functions

You can always implement your own embedding function by inheriting from `TextEmbeddingFunction`
(for text) or `EmbeddingFunction` (for multimodal data).
You can always implement your own embedding function:
- Python/TypeScript: subclass `TextEmbeddingFunction` (text) or `EmbeddingFunction` (multimodal).
- Rust: implement the `EmbeddingFunction` trait.

<CodeGroup>
```python Python icon="python"
from lancedb.embeddings import register, TextEmbeddingFunction
from functools import cached_property

@register("my-embedder")
class MyTextEmbedder(TextEmbeddingFunction):
model_name: str = "my-model"

def generate_embeddings(self, texts: list[str]) -> list[list[float]]:
# Your embedding logic here
return self._model.encode(texts).tolist()

def ndims(self) -> int:
# Return the dimensionality of the embeddings
return len(self.generate_embeddings(["test"])[0])

@cached_property
def _model(self):
# Initialize your model once
return MyEmbeddingModel(self.model_name)
```
<CodeBlock filename="Python" language="python" icon="python">
{PyEmbeddingFunction}
</CodeBlock>

<CodeBlock filename="TypeScript" language="typescript" icon="square-js">
{TsEmbeddingFunction}
</CodeBlock>

<CodeBlock filename="Rust" language="rust" icon="rust">
{RsEmbeddingFunction}
</CodeBlock>
</CodeGroup>
18 changes: 18 additions & 0 deletions docs/snippets/embedding.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,14 @@

export const PyAsyncOpenaiEmbeddings = "db = await lancedb.connect_async(uri)\nfunc = get_registry().get(\"openai\").create(name=\"text-embedding-ada-002\")\n\nclass Words(LanceModel):\n text: str = func.SourceField()\n vector: Vector(func.ndims()) = func.VectorField()\n\ntable = await db.create_table(\"words\", schema=Words, mode=\"overwrite\")\nawait table.add([{\"text\": \"hello world\"}, {\"text\": \"goodbye world\"}])\n\nquery = \"greetings\"\nactual = await (await table.search(query)).limit(1).to_pydantic(Words)[0]\nprint(actual.text)\n";

export const PyEmbeddingFunction = "from functools import cached_property\n\nfrom lancedb.embeddings import TextEmbeddingFunction, register\n\nclass MyEmbeddingModel:\n def __init__(self, model_name: str):\n self.model_name = model_name\n\n def encode(self, texts: list[str]) -> list[list[float]]:\n return [[1.0, 2.0, 3.0] for _ in texts]\n\n@register(\"my-embedder\")\nclass MyTextEmbedder(TextEmbeddingFunction):\n model_name: str = \"my-model\"\n\n def generate_embeddings(self, texts: list[str]) -> list[list[float]]:\n # Your embedding logic here\n return self._model.encode(texts)\n\n def ndims(self) -> int:\n # Return the dimensionality of the embeddings\n return len(self.generate_embeddings([\"test\"])[0])\n\n @cached_property\n def _model(self) -> MyEmbeddingModel:\n # Initialize your model once\n return MyEmbeddingModel(self.model_name)\n";

export const PyImports = "from lancedb.pydantic import LanceModel, Vector\nfrom lancedb.embeddings import get_registry\n";

export const PyManualQueryEmbeddings = "db = lancedb.connect(\"/tmp/db\")\nfunc = get_registry().get(\"openai\").create(name=\"text-embedding-ada-002\")\n\nclass Words(LanceModel):\n text: str = func.SourceField()\n vector: Vector(func.ndims()) = func.VectorField()\n\ntable = db.create_table(\"words\", schema=Words, mode=\"overwrite\")\ntable.add([{\"text\": \"hello world\"}, {\"text\": \"goodbye world\"}])\n\nquery_vector = func.generate_embeddings([\"greetings\"])[0]\n# --8<-- [start:manual_query_search]\n# query_vector is assumed to already be generated by your embedding function\nactual = table.search(query_vector).limit(1).to_pydantic(Words)[0]\nprint(actual.text)\n# --8<-- [end:manual_query_search]\n";

export const PyManualQuerySearch = "# query_vector is assumed to already be generated by your embedding function\nactual = table.search(query_vector).limit(1).to_pydantic(Words)[0]\nprint(actual.text)\n";

export const PyOpenaiEmbeddings = "db = lancedb.connect(\"/tmp/db\")\nfunc = get_registry().get(\"openai\").create(name=\"text-embedding-ada-002\")\n\nclass Words(LanceModel):\n text: str = func.SourceField()\n vector: Vector(func.ndims()) = func.VectorField()\n\ntable = db.create_table(\"words\", schema=Words, mode=\"overwrite\")\ntable.add([{\"text\": \"hello world\"}, {\"text\": \"goodbye world\"}])\n\nquery = \"greetings\"\nactual = table.search(query).limit(1).to_pydantic(Words)[0]\nprint(actual.text)\n";

export const PyRegisterDevice = "import torch\n\nregistry = get_registry()\nif torch.cuda.is_available():\n registry.set_var(\"device\", \"cuda\")\n\nfunc = registry.get(\"huggingface\").create(device=\"$var:device:cpu\")\n";
Expand All @@ -14,7 +20,19 @@ export const TsEmbeddingFunction = "const db = await lancedb.connect(databaseDir

export const TsImports = "import * as lancedb from \"@lancedb/lancedb\";\nimport \"@lancedb/lancedb/embedding/openai\";\nimport { LanceSchema, getRegistry, register } from \"@lancedb/lancedb/embedding\";\nimport { EmbeddingFunction } from \"@lancedb/lancedb/embedding\";\nimport { type Float, Float32, Utf8 } from \"apache-arrow\";\n";

export const TsManualQueryEmbeddings = "const db = await lancedb.connect(databaseDir);\nconst func = getRegistry()\n .get(\"openai\")\n ?.create({ model: \"text-embedding-ada-002\" }) as EmbeddingFunction;\n\nconst wordsSchema = LanceSchema({\n text: func.sourceField(new Utf8()),\n vector: func.vectorField(),\n});\nconst tbl = await db.createEmptyTable(\"words\", wordsSchema, {\n mode: \"overwrite\",\n});\nawait tbl.add([{ text: \"hello world\" }, { text: \"goodbye world\" }]);\n\nconst queryVector = await func.computeQueryEmbeddings(\"greetings\");\n// --8<-- [start:manual_query_search]\n// queryVector is assumed to already be generated by your embedding function\nconst actual = (await tbl.search(queryVector).limit(1).toArray())[0];\n// --8<-- [end:manual_query_search]\n";

export const TsManualQuerySearch = "// queryVector is assumed to already be generated by your embedding function\nconst actual = (await tbl.search(queryVector).limit(1).toArray())[0];\n";

export const TsOpenaiEmbeddings = "const db = await lancedb.connect(databaseDir);\nconst func = getRegistry()\n .get(\"openai\")\n ?.create({ model: \"text-embedding-ada-002\" }) as EmbeddingFunction;\n\nconst wordsSchema = LanceSchema({\n text: func.sourceField(new Utf8()),\n vector: func.vectorField(),\n});\nconst tbl = await db.createEmptyTable(\"words\", wordsSchema, {\n mode: \"overwrite\",\n});\nawait tbl.add([{ text: \"hello world\" }, { text: \"goodbye world\" }]);\n\nconst query = \"greetings\";\nconst actual = (await tbl.search(query).limit(1).toArray())[0];\n";

export const TsRegisterSecret = "const registry = getRegistry();\nregistry.setVar(\"api_key\", \"sk-...\");\n\nconst func = registry.get(\"openai\")!.create({\n apiKey: \"$var:api_key\",\n});\n";

export const RsEmbeddingFunction = "use std::{borrow::Cow, sync::Arc};\n\nuse arrow_array::{Array, FixedSizeListArray, Float32Array};\nuse arrow_schema::{DataType, Field, Schema};\nuse lancedb::{\n connect,\n embeddings::{EmbeddingDefinition, EmbeddingFunction},\n Result,\n};\n\n#[derive(Debug, Clone)]\nstruct MyTextEmbedder {\n dim: usize,\n}\n\nimpl EmbeddingFunction for MyTextEmbedder {\n fn name(&self) -> &str {\n \"my-embedder\"\n }\n\n fn source_type(&self) -> Result<Cow<'_, DataType>> {\n Ok(Cow::Owned(DataType::Utf8))\n }\n\n fn dest_type(&self) -> Result<Cow<'_, DataType>> {\n Ok(Cow::Owned(DataType::new_fixed_size_list(\n DataType::Float32,\n self.dim as i32,\n true,\n )))\n }\n\n fn compute_source_embeddings(&self, source: Arc<dyn Array>) -> Result<Arc<dyn Array>> {\n let values = Arc::new(Float32Array::from(vec![1.0f32; source.len() * self.dim]));\n let field = Arc::new(Field::new(\"item\", DataType::Float32, true));\n Ok(Arc::new(FixedSizeListArray::new(\n field,\n self.dim as i32,\n values,\n None,\n )))\n }\n\n fn compute_query_embeddings(&self, _input: Arc<dyn Array>) -> Result<Arc<dyn Array>> {\n unimplemented!()\n }\n}\n\n#[tokio::main]\nasync fn main() -> Result<()> {\n let db = connect(\"./mydb\").execute().await?;\n db.embedding_registry()\n .register(\"my-embedder\", Arc::new(MyTextEmbedder { dim: 3 }))?;\n\n let schema = Arc::new(Schema::new(vec![Field::new(\"text\", DataType::Utf8, false)]));\n db.create_empty_table(\"mytable\", schema)\n .add_embedding(EmbeddingDefinition::new(\n \"text\",\n \"my-embedder\",\n Some(\"vector\"),\n ))?\n .execute()\n .await?;\n\n Ok(())\n}\n";

export const RsManualQueryEmbeddings = "use std::{iter::once, sync::Arc};\n\nuse arrow_array::{record_batch, StringArray};\nuse arrow_schema::{DataType, Field, Schema};\nuse futures::StreamExt;\nuse lancedb::{\n connect,\n embeddings::{openai::OpenAIEmbeddingFunction, EmbeddingDefinition, EmbeddingFunction},\n query::{ExecutableQuery, QueryBase},\n Result,\n};\n\n#[tokio::main]\nasync fn main() -> Result<()> {\n let db = connect(\"./mydb\").execute().await?;\n let api_key = std::env::var(\"OPENAI_API_KEY\").expect(\"OPENAI_API_KEY is not set\");\n let embedding = Arc::new(OpenAIEmbeddingFunction::new_with_model(\n api_key,\n \"text-embedding-3-large\",\n )?);\n db.embedding_registry().register(\"openai\", embedding.clone())?;\n\n let schema = Arc::new(Schema::new(vec![Field::new(\"text\", DataType::Utf8, false)]));\n let table = db\n .create_empty_table(\"mytable\", schema)\n .add_embedding(EmbeddingDefinition::new(\"text\", \"openai\", Some(\"vector\")))?\n .execute()\n .await?;\n\n table\n .add(record_batch!((\"text\", Utf8, [\"This is a test.\", \"Another example.\"]))?)\n .execute()\n .await?;\n\n // Manually generate embeddings for the query (Cloud/Enterprise path)\n let query = Arc::new(StringArray::from_iter_values(once(\"test example\")));\n let query_vector = embedding.compute_query_embeddings(query)?;\n // --8<-- [start:manual_query_search]\n // query_vector is assumed to already be generated by your embedding function\n let mut results = table.vector_search(query_vector)?.limit(5).execute().await?;\n\n while let Some(batch) = results.next().await {\n println!(\"{:?}\", batch?);\n }\n // --8<-- [end:manual_query_search]\n\n Ok(())\n}\n";

export const RsManualQuerySearch = "// query_vector is assumed to already be generated by your embedding function\nlet mut results = table.vector_search(query_vector)?.limit(5).execute().await?;\n\nwhile let Some(batch) = results.next().await {\n println!(\"{:?}\", batch?);\n}\n";

export const RsOpenaiEmbeddings = "use std::{iter::once, sync::Arc};\n\nuse arrow_array::{record_batch, StringArray};\nuse arrow_schema::{DataType, Field, Schema};\nuse futures::StreamExt;\nuse lancedb::{\n connect,\n embeddings::{openai::OpenAIEmbeddingFunction, EmbeddingDefinition, EmbeddingFunction},\n query::{ExecutableQuery, QueryBase},\n Result,\n};\n\n#[tokio::main]\nasync fn main() -> Result<()> {\n let db = connect(\"./mydb\").execute().await?;\n let api_key = std::env::var(\"OPENAI_API_KEY\").expect(\"OPENAI_API_KEY is not set\");\n let embedding = Arc::new(OpenAIEmbeddingFunction::new_with_model(\n api_key,\n \"text-embedding-3-large\",\n )?);\n\n db.embedding_registry().register(\"openai\", embedding.clone())?;\n\n let schema = Arc::new(Schema::new(vec![Field::new(\"text\", DataType::Utf8, false)]));\n let table = db\n .create_empty_table(\"mytable\", schema)\n .add_embedding(EmbeddingDefinition::new(\"text\", \"openai\", Some(\"vector\")))?\n .execute()\n .await?;\n\n table\n .add(record_batch!((\"text\", Utf8, [\"This is a test.\", \"Another example.\"]))?)\n .execute()\n .await?;\n\n let query = Arc::new(StringArray::from_iter_values(once(\"test example\")));\n let query_vector = embedding.compute_query_embeddings(query)?;\n let mut results = table.vector_search(query_vector)?.limit(5).execute().await?;\n\n while let Some(batch) = results.next().await {\n println!(\"{:?}\", batch?);\n }\n\n Ok(())\n}\n";

Loading
Loading