Honcho Embed Reverse Proxy is a lightweight HTTP reverse proxy that intercepts OpenAI-compatible /v1/embeddings API requests. It automatically rewrites the model name and adds a dimensions parameter (default: 1536) before forwarding requests to the backend embedding server.
This proxy implements the workaround described in plastic-labs/honcho#404 for running fully local Honcho deployments with custom embedding models.
Honcho's embedding configuration has several hardcodes:
- Only supports
openai,gemini, oropenrouterproviders for custom base URLs - When using
openrouterprovider, model name is hardcoded toopenai/text-embedding-3-large - Database schema and code expect exactly 1536-dimensional embeddings
This reverse proxy sits between Honcho and your embedding server (e.g., vLLM), performing these transformations:
- Model name rewriting: Honcho requests
openai/text-embedding-3-large→ proxy rewrites to your actual model (e.g.,Qwen/Qwen3-Embedding-4B) - Dimensions injection: Adds
dimensions: 1536to all requests (configurable, default: 1536 for Honcho compatibility) - Response model fixing: Rewrites model name back in responses so Honcho sees what it expects
This allows you to use any embedding model with Honcho without modifying vLLM's --served-model-name or Honcho's source code.
This proxy's primary purpose is to:
- Intercept
/v1/embeddingsrequests from clients - Rewrite the model name from the client-facing model name to the actual backend model name
- Add dimensions parameter set to 1536 to all embedding requests
- Restore the original model name in the response before sending it back to the client
- Pass through all other requests unchanged to the backend
Requirements: Go 1.24.2 or later
go build -o honcho-embed-rp ../honcho-embed-rp \
-target "http://127.0.0.1:8000" \
-served-model "Qwen/Qwen3-Embedding-4B"Or using environment variables:
export HONCHOEMBEDRP_TARGET="http://127.0.0.1:8000"
export HONCHOEMBEDRP_SERVED_MODEL_NAME="Qwen/Qwen3-Embedding-4B"
./honcho-embed-rpConfigure the proxy using command-line flags or environment variables:
| Flag | Environment Variable | Default | Description |
|---|---|---|---|
-listen |
HONCHOEMBEDRP_LISTEN |
0.0.0.0 |
IP address to listen on |
-port |
HONCHOEMBEDRP_PORT |
9000 |
Port to listen on |
-target |
HONCHOEMBEDRP_TARGET |
http://127.0.0.1:8000 |
Backend target URL |
-loglevel |
HONCHOEMBEDRP_LOGLEVEL |
INFO |
Log level (COMPLETE, DEBUG, INFO, WARN, ERROR) |
-served-model |
HONCHOEMBEDRP_SERVED_MODEL_NAME |
(required) | Backend model name to use in outgoing requests |
-dimensions |
HONCHOEMBEDRP_DIMENSIONS |
1536 |
Embedding dimensions (1536 for Honcho compatibility) |
POST /v1/embeddings: Transformed (model name rewritten, dimensions=1536 added)GET /health: Health check endpoint (returns{"status":"healthy"})- All other paths: Passed through unchanged to the backend
Client Request (Honcho sends this):
POST /v1/embeddings
{
"model": "openai/text-embedding-3-large",
"input": "Hello, world!"
}Backend Request (after transformation):
POST /v1/embeddings
{
"model": "Qwen/Qwen3-Embedding-4B",
"input": "Hello, world!",
"dimensions": 1536
}Client Response (model name restored for Honcho):
{
"object": "list",
"data": [...],
"model": "openai/text-embedding-3-large",
"usage": {...}
}vLLM embedding server (Docker Compose):
services:
vllm-embedding:
image: vllm/vllm-openai:latest
command:
- Qwen/Qwen3-Embedding-4B
- --port
- "8000"
- --gpu-memory-utilization
- "0.5"
- --hf-overrides
- '{"is_matryoshka": true, "matryoshka_dimensions": [1536]}'
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]honcho-embed-rp (Docker Compose):
services:
honcho-embed-rp:
image: honcho-embed-rp:latest
environment:
- HONCHOEMBEDRP_TARGET=http://vllm-embedding:8000
- HONCHOEMBEDRP_SERVED_MODEL_NAME=Qwen/Qwen3-Embedding-4B
- HONCHOEMBEDRP_DIMENSIONS=1536
ports:
- "9000:9000"Honcho environment variables:
# Point OpenAI-compatible provider to the proxy
LLM_OPENAI_COMPATIBLE_BASE_URL=http://honcho-embed-rp:9000/v1
LLM_OPENAI_COMPATIBLE_API_KEY=sk-no-key-required
# Use openrouter provider (supports custom base URL)
# Honcho will request model: openai/text-embedding-3-large
LLM_EMBEDDING_PROVIDER=openrouter
# vLLM provider for LLM calls (separate endpoint)
DERIVER_PROVIDER=vllm
DERIVER_MODEL="your-llm-model-name"
LLM_VLLM_BASE_URL=http://your-vllm-llm:8000/v1
LLM_VLLM_API_KEY=your-api-keyWith this setup:
- Honcho requests embeddings from
openai/text-embedding-3-largeat the proxy (hardcoded by openrouter provider) - Proxy rewrites to
Qwen/Qwen3-Embedding-4Bwithdimensions: 1536 - vLLM serves the Qwen model with Matryoshka embeddings at 1536 dimensions
- Response model name is rewritten back to
openai/text-embedding-3-largefor Honcho compatibility
GET /health: Returns{"status":"healthy"}for Docker health checks
The proxy supports the following log levels:
| Level | Description |
|---|---|
COMPLETE |
Most verbose - includes full HTTP request/response dumps |
DEBUG |
Debug information including parameter application details |
INFO |
General operational information |
WARN |
Warning messages |
ERROR |
Error messages only |
When set to COMPLETE, the proxy will log full HTTP request and response bodies, which is useful for debugging but very verbose.
COMPLETE log level will expose all this data in plaintext. Only enable it in secure, non-production environments or ensure logs are properly secured and retained temporarily.
MIT License - see LICENSE file for details.