Skip to content

chux0519/embeddings.cpp

Repository files navigation

embeddings.cpp

A C++ library for text (and maybe image) embeddings, focusing on efficient inference of BERT-like (and maybe clip-like) models.

Overview

Many existing GGML-based text embedding libraries have limited support for Chinese text processing due to their custom tokenizer implementations. This project addresses this limitation by leveraging Hugging Face's Rust tokenizer implementation, wrapped with a C++ API to ensure consistency with the Python transformers library while providing native performance.

While currently focused on BERT-like text embedding models, the project aims to support image embedding models in the future (Work in Progress).

Note: This is an experimental and educational project. It is not recommended for production use at this time.

Supported Models

The following models have been tested and verified:

  • BAAI/bge-m3
  • BAAI/bge-base-zh-v1.5
  • shibing624/text2vec-base-multilingual
  • Snowflake/snowflake-arctic-embed-m-v2.0
  • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

The C++ implementation is checked against Python transformers CPU output. Models also supported by Hugging Face text-embeddings-inference can include TEI engine ORT as a performance comparator. For repeatable correctness and performance runs, use scripts/model_bench.py and the shared benchmark protocol in benchmarks/STANDARD.md.

Model Preparation

First, install the required dependencies:

uv pip install --torch-backend cpu -r scripts/requirements.txt

Then convert the models to GGUF format:

# Convert BGE-M3 model
uv run scripts/convert.py BAAI/bge-m3 ./models/bge-m3.fp16.gguf f16

# Convert BGE-Base Chinese v1.5 model
uv run scripts/convert.py BAAI/bge-base-zh-v1.5 ./models/bge-base-zh-v1.5.fp16.gguf f16

uv run scripts/convert.py Snowflake/snowflake-arctic-embed-m-v2.0 ./models/snowflake-arctic-embed-m-v2.0.fp16.gguf f16

# Convert Text2Vec multilingual model
uv run scripts/convert.py shibing624/text2vec-base-multilingual ./models/text2vec-base-multilingual.fp16.gguf f16

uv run scripts/convert.py sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 ./models/paraphrase-multilingual-MiniLM-L12-v2.fp16.gguf f16

Model Quantization

After converting models to GGUF format, you can quantize them to reduce memory usage and improve inference speed:

# Build the quantization tool
cmake --build build --target quantize

# Quantize a model (example with different quantization types)
./build/quantize ./models/bge-m3.fp16.gguf ./models/bge-m3.q4_k.gguf q4_k
./build/quantize ./models/bge-m3.fp16.gguf ./models/bge-m3.q6_k.gguf q6_k
./build/quantize ./models/bge-m3.fp16.gguf ./models/bge-m3.q8_0.gguf q8_0

# On Windows
.\build\Release\quantize.exe .\models\bge-m3.fp16.gguf .\models\bge-m3.q4_k.gguf q4_k

Supported Quantization Types

  • q4_k: 4-bit quantization with K-means clustering (good balance of size and quality)
  • q6_k: 6-bit quantization with K-means clustering (higher quality, larger size)
  • q8_0: 8-bit quantization (minimal quality loss, moderate size reduction)
  • Other GGML quantization types as supported by the library

Usage

quantize <input_model.gguf> <output_model.gguf> <qtype>

The quantization tool will:

  1. Load the input GGUF model
  2. Quantize eligible tensors (typically weight matrices)
  3. Preserve metadata and non-quantizable tensors
  4. Output size comparison and compression statistics

Python Package Usage

Install embeddings.cpp:

pip install "embeddings-cpp[hub]"

Load published GGUF artifacts directly from Hugging Face:

from embeddings_cpp import load

model = load("BAAI/bge-m3")
vectors = model.batch_encode(["hello world", "你好,世界"])

model = load("Snowflake/snowflake-arctic-embed-m-v2.0")
vectors = model.batch_encode(["hello world", "你好,世界"])

For machine-specific local builds, GGML_NATIVE can be enabled explicitly:

EMBEDDINGS_CPP_NATIVE=1 pip install --no-binary embeddings-cpp embeddings-cpp

PyPI wheels keep GGML_NATIVE=OFF so they run on a broad range of CPUs.

Running From Source

Before running source tests, install embeddings.cpp from the checkout:

# use CMAKE_ARGS to add more cmake settings
$env:CMAKE_ARGS="-DGGML_VULKAN=ON"

# Install the package
pip install .

# Generate Python stub files
cd build && make stub

# on Windows
pip install pybind11-stubgen
# then
pybind11-stubgen embeddings_cpp -o .

python tests/test_tokenizer.py

Alignment and Benchmarking

Run registry-driven correctness and benchmark checks for all supported registry models:

uv run scripts/model_bench.py --all-models --convert-missing

Use --models-file for a stable subset:

uv run scripts/model_bench.py --models-file scripts/registry_models.txt --convert-missing

For model benchmark work, use the registry-driven unified runner. It uses the shared benchmark protocol and structured model config, and currently focuses on Python CPU, TEI engine ORT, and embeddings.cpp:

uv run scripts/model_bench.py \
  --models BAAI/bge-m3 \
  --runners python_cpu embeddings_cpp tei_engine_ort \
  --quantizations q8_0 \
  --batch-sizes 1 4 8

tei_engine_ort requires a local text-embeddings-inference checkout at ../text-embeddings-inference or an explicit --tei-repo-dir.

For focused BGE-M3 single-request and batch validation without TEI:

uv run scripts/model_bench.py \
  --models BAAI/bge-m3 \
  --runners python_cpu embeddings_cpp \
  --convert-missing \
  --batch-sizes 1 4 8

By default the cosine thresholds are taken from the model registry and are used as report tolerances, not process-failure gates. To explore looser product tolerances, pass them explicitly:

uv run scripts/model_bench.py \
  --models BAAI/bge-m3 \
  --runners python_cpu embeddings_cpp \
  --batch-sizes 1 4 8 \
  --min-cos 0.95 \
  --batch-min-cos 0.95 \
  --quantized-batch-min-cos 0.95

For CI-style checks that should fail on tolerance misses, add --fail-on-threshold.

To produce a Snowflake-style BGE-M3 optimization table with Python CPU as the correctness and speed baseline, sweep k-quant variants and CPU repack modes:

cmake --build build --target quantize
uv run scripts/model_bench.py \
  --models BAAI/bge-m3 \
  --runners python_cpu embeddings_cpp \
  --convert-missing \
  --quantize-missing \
  --quantizations fp16 q8_0 q6_k q4_k \
  --repack-modes off on \
  --batch-sizes 1 4 8

The generated model_bench_*.md report includes correctness, raw performance, optimization-sweep, and best-variant-by-batch tables under scripts/output/. Stable benchmark summaries are kept under benchmarks/ and linked from the README instead of copying every model's full benchmark table inline. New model reports should follow the shared benchmarks/STANDARD.md protocol. The current BGE-M3 report is benchmarks/bge-m3.md.

The benchmark report compares Python transformers CPU, embeddings.cpp, and TEI when enabled for the model. For Snowflake on CPU, the only cross-implementation format all three runners share is fp32, so the README keeps the fair cross-runner table in fp32 and moves embeddings.cpp quantization results to a separate trade-off table.

Measured on this PC:

  • CPU: Intel Xeon E5-2673 v3 @ 2.40GHz
  • Cores: 12 vCPU, 1 socket, SMT off
  • Memory: 62 GiB RAM
  • OS: Ubuntu Linux 5.15
  • Model: Snowflake/snowflake-arctic-embed-m-v2.0
  • Fair baseline GGUF: models/snowflake-arctic-embed-m-v2.0.fp32.gguf
  • Production GGUF: models/snowflake-arctic-embed-m-v2.0.q4_k_mlp_q8_attn.gguf

Fair cross-runner baseline, threads=12, batch=8, serial runs, scope=end_to_end. This table keeps all three runners in fp32 where the comparison is like-for-like and uses the same realistic randomized text pool as the production table below:

Runner Format Mean ms P50 ms P95 ms Text/s RSS MB
python_cpu HF fp32 144.25 97.91 354.04 55.46 1148.9
tei_engine ORT fp32 102.62 92.49 117.58 77.96 1972.6
embeddings.cpp GGUF fp32 249.68 233.98 311.03 32.04 1534.6

For this Snowflake CPU path, the TEI row above is mostly an ONNX Runtime result, not a router result. TEI's ORT backend applies graph optimization on this model. A future Candle-only TEI row would be a different backend and should be reported as a separate line, not mixed into the main fp32 baseline.

Production comparator, threads=12, batch=8, warmup=3, iterations=12, serial runs, scope=end_to_end. This table uses scripts/profile_snowflake.py with its realistic randomized text pool and includes tokenization on every runner. For Snowflake on this host, p50/p95 and RSS are the primary numbers; mean can be distorted by host jitter.

Runner Backend / Format Mean ms P50 ms P95 ms Text/s RSS MB Accuracy vs Python fp32
python_cpu HF fp32 91.01 84.91 110.71 87.90 1157.0 reference
tei_engine_ort ORT fp32 96.89 97.89 107.98 82.57 1965.3 reference-level fp32
embeddings.cpp GGUF q4_k_mlp_q8_attn 90.02 92.43 94.23 88.87 543.1 min cos 0.991448

Snowflake CPU benchmark

With this production quantization, embeddings.cpp is in the same end-to-end latency tier as Python CPU and TEI ORT on this host while using much less RSS. Compared with TEI ORT, resident memory is about 3.6x lower. Compared with Python CPU, resident memory is about 2.1x lower.

The mixed quantization keeps attention weights at q8_0 and quantizes MLP weights to q4_K. That is intentional: attention is more accuracy-sensitive on this model, while the MLP dominates model size and benefits most from lower-bit weights.

embeddings.cpp quantization trade-offs on the same machine, same threads=12, same batch=8. These rows come from isolated per-quant sweeps:

Quant Size MB Mean ms Text/s RSS MB Worst Min Cos Batch Min Cos
fp16 591.3 128.87 62.08 888.3 0.999985 1.000000
q8_0 318.3 95.77 83.53 615.2 0.998978 1.000000
q6_k 247.8 98.51 81.21 545.0 0.992130 1.000000
q5_k 209.2 109.29 73.20 506.1 0.983449 1.000000
q5_0 209.2 139.20 57.47 506.2 0.984581 1.000000
q4_0_attnf16 211.6 67.87 117.88 508.4 0.983455 1.000000
q4_0_mlp_q5_0_attn 176.1 77.90 102.69 473.0 0.967307 1.000000
q4_0_mlp_q6_k_attn 179.7 69.51 115.09 476.6 0.978769 1.000000
q4_0_mlp_q8_attn 186.3 61.03 131.09 483.4 0.981470 1.000000
q4_k_mlp_attnf16 211.6 63.58 125.83 508.6 0.991325 0.999550
q4_k_mlp_q8_attn 186.3 58.86 135.91 483.3 0.991226 0.999321
q4_0 172.8 62.17 128.67 469.8 0.948146 1.000000
q4_k 172.8 47.41 168.74 469.7 0.936614 0.994469
q4_0_embf16 436.0 52.28 153.02 732.9 0.946242 1.000000
q4_0_mlpf16 289.2 158.27 50.55 586.4 0.956237 1.000000
q4_0_mlp_q4_k_attn 172.8 50.50 158.41 469.7 0.944264 0.997765

Observed on this CPU:

  • q8_0 is the conservative compression point: much lower RSS than fp32, with very small output drift.
  • q6_k is the smallest config that still keeps Worst Min Cos above 0.99 in this Snowflake suite.
  • q4_k_mlp_q8_attn is the current production default because it stays close to Python CPU and TEI ORT end-to-end latency while cutting RSS sharply.
  • q4_k, q4_0, q4_0_embf16, and q4_0_mlp_q4_k_attn are fast, but the output drift is large enough that they are not the default recommendation for correctness-sensitive use.
  • q4_0_mlpf16 is neither fast nor especially accurate on this model, so it is not an attractive point in the trade-off space.

For Snowflake/GTE on CPU, embeddings.cpp now enables flash attention and CPU repack by default; set EMBEDDINGS_CPP_FLASH_ATTN=0 or EMBEDDINGS_CPP_CPU_REPACK=0 only when debugging or checking regressions.

Standalone benchmark runs also write JSON and Markdown reports under scripts/output/:

uv run scripts/benchmark.py \
  --model-id Snowflake/snowflake-arctic-embed-m-v2.0 \
  --gguf-path models/snowflake-arctic-embed-m-v2.0.q4_k_mlp_q8_attn.gguf

Pin the C++ CPU thread count while tuning:

uv run scripts/model_bench.py --all-models --cpp-threads 8

For models also supported by text-embeddings-inference, include TEI engine ORT as an additional performance comparator:

uv run scripts/model_bench.py \
  --models Snowflake/snowflake-arctic-embed-m-v2.0 \
  --runners python_cpu embeddings_cpp tei_engine_ort

For registry-driven Snowflake checks against the optimized mixed GGUF:

uv run scripts/correctness.py --model-id Snowflake/snowflake-arctic-embed-m-v2.0 --benchmark

Loading Published GGUF Models

Known optimized GGUF artifacts are listed in embeddings_cpp/registry.json. The default Snowflake artifact is published under the chux0519 Hugging Face namespace.

  • Model repository: https://huggingface.co/chux0519/snowflake-arctic-embed-m-v2.0-gguf-embeddings-cpp
  • Direct GGUF file: https://huggingface.co/chux0519/snowflake-arctic-embed-m-v2.0-gguf-embeddings-cpp/resolve/main/snowflake-arctic-embed-m-v2.0.q4_k_mlp_q8_attn.gguf
from embeddings_cpp import load

model = load("Snowflake/snowflake-arctic-embed-m-v2.0")
vectors = model.batch_encode(["hello world", "你好,世界"])

By default, CPU inference uses the detected CPU concurrency. Pin EMBEDDINGS_CPP_THREADS=N only after measuring a specific host or container CPU quota.

Install the optional Hugging Face dependency when downloading from the Hub:

pip install "embeddings-cpp[hub]"

Browser Runtime

The Snowflake production artifact also runs in Chromium through browser WASM and Browser builds are available for the Snowflake GGUF. The npm-facing browser package currently defaults to stable single-thread wasm; webgpu is experimental. It now includes dedicated kernels for several Snowflake-specific GTE ops, but the browser backend still needs broader operator coverage and browser-specific tuning before it should replace the default WASM path. The older engine-only browser benchmark below is useful for backend tracking, but it excludes tokenizer/package overhead and should not be read as the npm package default.

Platform for the browser numbers below:

  • Host: Mac mini Mac16,10
  • CPU: Apple M4
  • Memory: 16 GiB
  • OS: macOS 26.3.1
  • Browser: Google Chrome
  • Model: models/snowflake-arctic-embed-m-v2.0.q4_k_mlp_q8_attn.gguf
  • Scope: browser engine-only forward, tokenizer excluded
Scenario WASM single-thread ms WASM pthread x8 ms WebGPU ms WebGPU speedup vs pthread
batch=1, short sentence 165.24 56.24 35.20 1.60x
batch=8, mixed multilingual batch 1298.39 342.91 51.31 6.68x
batch=8, short question set 1458.59 390.97 50.85 7.69x

Browser benchmark on Apple M4 Chrome

The static demo is at demo/browser-wasm/index.html, and supports both preload-based bundles and dynamic GGUF download mode. In downloaded mode, the page can fetch the published Snowflake GGUF, cache the browser runtime bundle plus model bytes in Cache Storage, and reuse them across reloads. The full method plus detailed numbers are in docs/BROWSER_BENCHMARK.md.

For browser correctness, the text-to-vector demo is checked against local embeddings.cpp output with three default cases: Chinese, English, and mixed-language input.

python3 scripts/browser_wasm_bench_server.py --host 127.0.0.1 --port 18081 --root "$PWD"
python3 scripts/browser_e2e_compare.py --base-url http://127.0.0.1:18081

HTTP Server

The server can load a registered model from Hugging Face or a local GGUF path. For a Snowflake deployment, embeddings.cpp is intended to replace a TEI CPU setup.

TEI Comparison

For Snowflake/snowflake-arctic-embed-m-v2.0, the deployment mapping is:

Concern TEI embeddings.cpp
Container image ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 ghcr.io/<owner>/embeddings-cpp-server:<tag> or a locally built image
Model source Hugging Face model repo Registered optimized GGUF from chux0519/snowflake-arctic-embed-m-v2.0-gguf-embeddings-cpp or --gguf-path
Main request path POST /embed POST /embed
OpenAI-style path not the primary TEI path POST /v1/embeddings
Batch token guard --max-batch-tokens --max-batch-tokens
Thread control TEI runtime defaults detected CPU concurrency by default, override with --threads or EMBEDDINGS_CPP_THREADS only after measurement
Health probes /health /health, /ready, /info

The TEI Snowflake command:

mkdir -p .cache/tei

docker run --rm -p 8081:80 \
  -v "$PWD/.cache/tei:/data" \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
  --model-id Snowflake/snowflake-arctic-embed-m-v2.0 \
  --max-batch-tokens 8192

Reusing .cache/tei avoids downloading the same TEI model artifacts again on every benchmark run.

The equivalent embeddings.cpp server run is:

python -m embeddings_cpp.server \
  --model-id Snowflake/snowflake-arctic-embed-m-v2.0 \
  --port 8080 \
  --max-batch-tokens 8192

Build and run the Docker image locally:

docker build -t embeddings-cpp-server:local .

docker run --rm -p 8080:80 \
  embeddings-cpp-server:local \
  --model-id Snowflake/snowflake-arctic-embed-m-v2.0 \
  --max-batch-tokens 8192

Endpoints:

  • GET /health
  • GET /ready
  • GET /info
  • POST /embed with {"inputs": ["hello", "world"]}
  • POST /v1/embeddings with an OpenAI-compatible embeddings request

For client compatibility, the main request surfaces are:

  • TEI: POST /embed
  • embeddings.cpp: POST /embed
  • OpenAI-style clients: POST /v1/embeddings

For correctness work, the Snowflake path is checked against Python transformers CPU output and optionally TEI engine ORT. See docs/TEST_MATRIX.md and scripts/server_compare.py. For performance work, scripts/model_bench.py reports inference speed and RSS memory.

Container images can be published to GHCR with .github/workflows/publish-server-image.yml, which publishes tags in the form ghcr.io/<owner>/embeddings-cpp-server:<tag>.

Building from Source

macOS (ARM)

Configure and build with Metal support:

cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
      -DGGML_METAL=ON \
      -DGGML_METAL_EMBED_LIBRARY=ON \
      -DEMBEDDINGS_CPP_ENABLE_PYBIND=ON ..

If you encountered openmp's bug, try

brew install libomp

export OpenMP_ROOT=$(brew --prefix)/opt/libomp

Windows

build with vulkan support:

cmake -DGGML_VULKAN=ON -DEMBEDDINGS_CPP_ENABLE_PYBIND=ON ..
# If you encounter any issues, ensure that your graphics driver and Vulkan SDK versions are compatible.
# You can also add -DGGML_VULKAN_DEBUG=ON -DGGML_VULKAN_VALIDATE=ON for debuging

Debugging

GGML debug support is now enabled by default in the vendored version. This provides better debugging capabilities for CPU backend operations without requiring additional patches.

For more information about GGML debugging features, see: ggml-org/ggml#655

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors