Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 153 additions & 23 deletions integrations/vllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,57 +8,187 @@ authors:
github: deepset-ai
twitter: deepset_ai
linkedin: https://www.linkedin.com/company/deepset-ai/
pypi: https://pypi.org/project/haystack-ai
repo: https://github.com/deepset-ai/haystack
pypi: https://pypi.org/project/vllm-haystack
repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/vllm
type: Model Provider
report_issue: https://github.com/deepset-ai/haystack/issues
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
logo: /logos/vllm.png
version: Haystack 2.0
toc: true
---
Simply use [vLLM](https://github.com/vllm-project/vllm) in your haystack pipeline, to utilize fast, self-hosted LLMs.

### Table of Contents

- [Overview](#overview)
- [Installation](#installation)
- [Components](#components)
- [Usage](#usage)
- [Serving a model with vLLM](#serving-a-model-with-vllm)
- [VLLMChatGenerator](#vllmchatgenerator)
- [VLLMTextEmbedder and VLLMDocumentEmbedder](#vllmtextembedder-and-vllmdocumentembedder)
- [VLLMRanker](#vllmranker)
- [End-to-end example](#end-to-end-example)
- [License](#license)

## Overview

[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
It is an open-source project that allows serving open models in production, when you have GPU resources available.

vLLM can be deployed as a server that implements the OpenAI API protocol and integration with Haystack comes out-of-the-box.
This allows vLLM to be used with the [`OpenAIGenerator`](https://docs.haystack.deepset.ai/docs/openaigenerator) and [`OpenAIChatGenerator`](https://docs.haystack.deepset.ai/docs/openaichatgenerator) components in Haystack.
vLLM serves models behind an OpenAI-compatible HTTP server and supports generative, embedding, and ranking models. The `vllm-haystack` integration provides dedicated Haystack components that connect to a running vLLM server.

For an end-to-end example of [vLLM + Haystack, see this notebook](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/vllm_inference_engine.ipynb).
## Installation

Install vLLM following the [official instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html). For production use cases, there are other options, including [Docker](https://docs.vllm.ai/en/latest/deployment/docker).

Then install the Haystack integration:

## Installation
vLLM should be installed.
- you can use `pip`: `pip install vllm` (more information in the [vLLM documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html))
- for production use cases, there are many other options, including Docker ([docs](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html))
```bash
pip install haystack-ai vllm
pip install vllm-haystack
```

## Components

This integration introduces the following components:

- [**VLLMChatGenerator**](https://docs.haystack.deepset.ai/docs/vllmchatgenerator): A component for chat completion using generative models served by vLLM. Supports streaming, tool calling, reasoning, and structured outputs.

- [**VLLMTextEmbedder**](https://docs.haystack.deepset.ai/docs/vllmtextembedder): A component for embedding a single string (e.g., a query) using an embedding model served by vLLM.

- [**VLLMDocumentEmbedder**](https://docs.haystack.deepset.ai/docs/vllmdocumentembedder): A component for embedding a list of `Document` objects using an embedding model served by vLLM.

- [**VLLMRanker**](https://docs.haystack.deepset.ai/docs/vllmranker): A component for reranking documents using a ranking model (cross-encoder or late interaction) served by vLLM.

## Usage
You first need to run an vLLM OpenAI-compatible server. You can do that using [Python](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server) or [Docker](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html).

Then, you can use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.
### Serving a model with vLLM

`vllm serve` launches an OpenAI-compatible server. For example, to serve a small generative model with reasoning and tool-calling enabled:

```bash
vllm serve "Qwen/Qwen3-0.6B" --port 8000 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser hermes
```

Embedding and ranking models are served the same way. Just point `vllm serve` at the relevant model (e.g., `sentence-transformers/all-MiniLM-L6-v2` or `BAAI/bge-reranker-base`).

### VLLMChatGenerator

```python
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

generator = OpenAIChatGenerator(
api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), # for compatibility with the OpenAI API, a placeholder api_key is needed
model="mistralai/Mistral-7B-Instruct-v0.1",
llm = VLLMChatGenerator(
model="Qwen/Qwen3-0.6B",
api_base_url="http://localhost:8000/v1",
generation_kwargs = {"max_tokens": 512}
generation_kwargs={"extra_body": {"chat_template_kwargs": {"enable_thinking": True}}},
)

response = generator.run(messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")])
response = llm.run(messages=[ChatMessage.from_user("Write Python code to reverse a string.")])
print(response["replies"][0].text)

# When reasoning is enabled, the reasoning trace is available separately:
print(response["replies"][0].reasoning)
```

`VLLMChatGenerator` also supports structured outputs via `response_format` and tool calling, making it a drop-in chat generator for Haystack `Agent` pipelines.

### VLLMTextEmbedder and VLLMDocumentEmbedder

Use the two embedders together to build a simple semantic retrieval pipeline:

```python
from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.embedders.vllm import (
VLLMDocumentEmbedder,
VLLMTextEmbedder,
)

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

docs = [
Document(content="My name is Wolfgang and I live in Berlin"),
Document(content="My name is Luca and I live in Milan"),
Document(content="Germany has many big cities"),
Document(content="Italy is a country in Europe"),
]

document_embedder = VLLMDocumentEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2",
api_base_url="http://localhost:8000/v1",
)
document_store.write_documents(document_embedder.run(docs)["documents"])

query_pipeline = Pipeline()
query_pipeline.add_component(
"text_embedder",
VLLMTextEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2",
api_base_url="http://localhost:8000/v1",
),
)
query_pipeline.add_component(
"retriever",
InMemoryEmbeddingRetriever(document_store=document_store, top_k=2),
)
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

result = query_pipeline.run({"text_embedder": {"text": "Who lives in Berlin?"}})
for doc in result["retriever"]["documents"]:
print(doc.score, doc.content)
# 0.668... My name is Wolfgang and I live in Berlin
# 0.602... Germany has many big cities
```

### VLLMRanker

Pair `VLLMRanker` with a fast first-stage retriever (e.g., BM25) to rerank candidates by relevance to the query:

```python
from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.rankers.vllm import VLLMRanker

docs = [
Document(content="Paris is the capital city of France"),
Document(content="Lyon is a major city in France known for cuisine"),
Document(content="Toulouse is a large city in France known for aerospace"),
Document(content="Marseille is a port city in southern France"),
Document(content="France has a rich history and culture"),
Document(content="Berlin is the capital of Germany"),
Document(content="Madrid is the capital city of Spain"),
]
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)

retriever = InMemoryBM25Retriever(document_store=document_store, top_k=10)
ranker = VLLMRanker(
model="BAAI/bge-reranker-base",
api_base_url="http://localhost:8000/v1",
top_k=3,
)

pipeline = Pipeline()
pipeline.add_component("retriever", retriever)
pipeline.add_component("ranker", ranker)
pipeline.connect("retriever.documents", "ranker.documents")

query = "france cities"
result = pipeline.run({"retriever": {"query": query}, "ranker": {"query": query}})
for doc in result["ranker"]["documents"]:
print(doc.score, doc.content)
# 0.986... Paris is the capital city of France
# 0.914... Lyon is a major city in France known for cuisine
# 0.858... Toulouse is a large city in France known for aerospace
```

## End-to-end example

For a complete walkthrough covering generative, embedding, and ranking models — including a tool-calling agent and a retrieval + reranking pipeline — see the [vLLM + Haystack cookbook notebook](https://haystack.deepset.ai/cookbook/vllm_inference_engine).

## License

`vllm-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.
Binary file modified logos/vllm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.