LLM Inference Engine Benchmarking

This repository contains the code to reproduce the experiments evaluating and comparing various LLM inference engines (vLLM, LMDeploy, SGLang, llama.cpp, Ollama, and Hugging Face Transformers), as well as the scripts and data for our ecosystem surveys.

Prerequisites

Docker and Docker Compose
NVIDIA GPU with drivers supporting CUDA 12+ and the NVIDIA Container Toolkit.
Hugging Face Token (for downloading restricted models/datasets).
OpenAI API Key (for LLM-as-a-judge benchmarks like SimpleQA and JailbreakBench).

1. Environment Setup

Create a .env file in the root directory (or export these variables to your shell):

HF_TOKEN=your_huggingface_token
OPENAI_API_KEY=your_openai_api_key

2. Download Models & Datasets

We provide a Dockerized setup script to automatically download the necessary Hugging Face datasets and model weights, and convert them to the required formats (like GGUF).

Run the setup container (this may take a while depending on your bandwidth):

docker compose up setup

(Optional) If you are running the Ollama experiments, you also need to initialize the Ollama models:

docker compose up prepare_ollama

3. Running the Main Experiments

The primary entry point for the benchmarks is main.py, which is executed inside the comparison_experiment container.

You can run an experiment by passing a model config and a benchmark config. For example, to evaluate Llama-3 8B on GSM8K using vLLM:

docker compose run --rm comparison_experiment python main.py \
    --model_config configs/model/llama3_8B_vllm.yaml \
    --benchmark_config configs/benchmark/gsm8k.yaml \
    --seed 42 \
    --output_dir ./results

Arguments:

--model_config: Path to the engine/model YAML config (see configs/model/).
--benchmark_config: Path to the benchmark YAML config (see configs/benchmark/).
--seed: Random seed for reproducibility.
--output_dir: Where to save the generated JSON results.
--run_mode: full (generate and evaluate) or generate_only.

4. Layer Tracking & Engine Comparison (Optional)

To run the internal layer divergence tracking experiment, you must first instrument the inference engines by cloning their source code and applying our tracking modifications.

Step 4.1: Clone the Engine Repositories

Run the following from the root of this project to clone the specific engine versions:

# vLLM
git clone https://github.com/vllm-project/vllm.git ./vllm_source
cd vllm_source && git checkout v0.10.2 -b feat/layer-tracking && cd ..

# Transformers
git clone https://github.com/huggingface/transformers.git ./transformers_source
cd transformers_source && git checkout v4.57.0 -b feat/layer-tracking && cd ..

# SGLang
git clone https://github.com/sgl-project/sglang.git ./sglang_source
cd sglang_source && git checkout v0.5.2 -b feat/layer-tracking && cd ..

# LMDeploy
git clone https://github.com/InternLM/lmdeploy.git ./lmdeploy_source
cd lmdeploy_source && git checkout v0.10.1 -b feat/layer-tracking && cd ..

# llama-cpp-python
git clone https://github.com/abetlen/llama-cpp-python.git ./llama_cpp_python_source
cd llama_cpp_python_source && git checkout v0.3.16 -b feat/layer-tracking
git submodule update --init --recursive
cd ..

# Ollama
git clone https://github.com/ollama/ollama.git ./ollama_source
cd ollama_source && git checkout v0.13.5 -b feat/layer-tracking && cd ..

Step 4.2: Apply Modified Source Files

Copy the contents of the included modified_source directory into their respective cloned repositories. This will overwrite the original files with our instrumented versions that dump the hidden states.

Step 4.3: Build Custom Ollama Archive

Because Ollama is distributed as a Go binary, we need to pre-build our modified version into a tarball so the Docker container can install it.

cd ollama_source

DOCKER_BUILDKIT=1 docker buildx build \
  --platform linux/amd64 \
  --target archive \
  --output type=local,dest=custom-dist \
  .

cd custom-dist
tar -czvf ../../custom-ollama-linux-amd64.tgz .

# Return to project root
cd ../../

Step 4.4: Rebuild and Run the Tracking Container

Now, build the tracking Docker image (which will install the local, modified python packages) and run the generation script:

# Build the tracking image
docker compose build layer_tracking

# Run the tracking script to generate intermediate tensors
docker compose up layer_tracking

Step 4.5: Compare Engines

Once the tensors are generated, you can compare the outputs of a target engine against the Hugging Face baseline locally:

python compare_engines.py \
    --ref ./tensors/hf_llama_vllm_comp \
    --target ./tensors/vllm_llama \
    --target-engine vllm

5. Paper & Landscape Surveys

In addition to the benchmarking infrastructure, this repository includes the data and scripts used for the literature and ecosystem surveys presented in the paper.

landscape_survey/: Contains the raw data of our broader LLM inference ecosystem landscape survey, classifying engines, routers, and managed platforms.
paper_survey/: Contains the full pipeline used to systematically scrape and analyze ML research papers. Key components include:
- scraper/: Scripts to pull metadata and PDFs from the ACL Anthology and OpenReview.
- filters/ & unify_papers.py: Utilities to merge datasets, handle symlinking of PDFs, and perform initial keyword filtering.
- classifier/: An automated, local LLM-as-a-judge pipeline (powered by vLLM) that extracts the reported code repositories, identifies inference engines used by authors, and classifies paper relevance based on detailed prompts.
- analyze_*.py: Scripts to process the classified data, extract GitHub repository statistics (e.g., verifying dependencies), normalize code domains, and compute overall engine usage statistics.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
configs		configs
judges		judges
landscape_survey		landscape_survey
llm_inference		llm_inference
modified_source		modified_source
paper_survey		paper_survey
requirements		requirements
setup		setup
utils		utils
Dockerfile.comparison_experiment		Dockerfile.comparison_experiment
Dockerfile.layer_tracking		Dockerfile.layer_tracking
Dockerfile.prepare_ollama		Dockerfile.prepare_ollama
Dockerfile.setup		Dockerfile.setup
README.md		README.md
compare_engines.py		compare_engines.py
docker-compose.yml		docker-compose.yml
evaluate_results.py		evaluate_results.py
main.py		main.py
tracking_script.py		tracking_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference Engine Benchmarking

Prerequisites

1. Environment Setup

2. Download Models & Datasets

3. Running the Main Experiments

4. Layer Tracking & Engine Comparison (Optional)

Step 4.1: Clone the Engine Repositories

Step 4.2: Apply Modified Source Files

Step 4.3: Build Custom Ollama Archive

Step 4.4: Rebuild and Run the Tracking Container

Step 4.5: Compare Engines

5. Paper & Landscape Surveys

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Engine Benchmarking

Prerequisites

1. Environment Setup

2. Download Models & Datasets

3. Running the Main Experiments

4. Layer Tracking & Engine Comparison (Optional)

Step 4.1: Clone the Engine Repositories

Step 4.2: Apply Modified Source Files

Step 4.3: Build Custom Ollama Archive

Step 4.4: Rebuild and Run the Tracking Container

Step 4.5: Compare Engines

5. Paper & Landscape Surveys

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages