This repository contains the code to reproduce the experiments evaluating and comparing various LLM inference engines (vLLM, LMDeploy, SGLang, llama.cpp, Ollama, and Hugging Face Transformers), as well as the scripts and data for our ecosystem surveys.
- Docker and Docker Compose
- NVIDIA GPU with drivers supporting CUDA 12+ and the NVIDIA Container Toolkit.
- Hugging Face Token (for downloading restricted models/datasets).
- OpenAI API Key (for LLM-as-a-judge benchmarks like SimpleQA and JailbreakBench).
Create a .env file in the root directory (or export these variables to your shell):
HF_TOKEN=your_huggingface_token
OPENAI_API_KEY=your_openai_api_keyWe provide a Dockerized setup script to automatically download the necessary Hugging Face datasets and model weights, and convert them to the required formats (like GGUF).
Run the setup container (this may take a while depending on your bandwidth):
docker compose up setup(Optional) If you are running the Ollama experiments, you also need to initialize the Ollama models:
docker compose up prepare_ollamaThe primary entry point for the benchmarks is main.py, which is executed inside the comparison_experiment container.
You can run an experiment by passing a model config and a benchmark config. For example, to evaluate Llama-3 8B on GSM8K using vLLM:
docker compose run --rm comparison_experiment python main.py \
--model_config configs/model/llama3_8B_vllm.yaml \
--benchmark_config configs/benchmark/gsm8k.yaml \
--seed 42 \
--output_dir ./resultsArguments:
--model_config: Path to the engine/model YAML config (seeconfigs/model/).--benchmark_config: Path to the benchmark YAML config (seeconfigs/benchmark/).--seed: Random seed for reproducibility.--output_dir: Where to save the generated JSON results.--run_mode:full(generate and evaluate) orgenerate_only.
To run the internal layer divergence tracking experiment, you must first instrument the inference engines by cloning their source code and applying our tracking modifications.
Run the following from the root of this project to clone the specific engine versions:
# vLLM
git clone https://github.com/vllm-project/vllm.git ./vllm_source
cd vllm_source && git checkout v0.10.2 -b feat/layer-tracking && cd ..
# Transformers
git clone https://github.com/huggingface/transformers.git ./transformers_source
cd transformers_source && git checkout v4.57.0 -b feat/layer-tracking && cd ..
# SGLang
git clone https://github.com/sgl-project/sglang.git ./sglang_source
cd sglang_source && git checkout v0.5.2 -b feat/layer-tracking && cd ..
# LMDeploy
git clone https://github.com/InternLM/lmdeploy.git ./lmdeploy_source
cd lmdeploy_source && git checkout v0.10.1 -b feat/layer-tracking && cd ..
# llama-cpp-python
git clone https://github.com/abetlen/llama-cpp-python.git ./llama_cpp_python_source
cd llama_cpp_python_source && git checkout v0.3.16 -b feat/layer-tracking
git submodule update --init --recursive
cd ..
# Ollama
git clone https://github.com/ollama/ollama.git ./ollama_source
cd ollama_source && git checkout v0.13.5 -b feat/layer-tracking && cd ..Copy the contents of the included modified_source directory into their respective cloned repositories. This will overwrite the original files with our instrumented versions that dump the hidden states.
Because Ollama is distributed as a Go binary, we need to pre-build our modified version into a tarball so the Docker container can install it.
cd ollama_source
DOCKER_BUILDKIT=1 docker buildx build \
--platform linux/amd64 \
--target archive \
--output type=local,dest=custom-dist \
.
cd custom-dist
tar -czvf ../../custom-ollama-linux-amd64.tgz .
# Return to project root
cd ../../Now, build the tracking Docker image (which will install the local, modified python packages) and run the generation script:
# Build the tracking image
docker compose build layer_tracking
# Run the tracking script to generate intermediate tensors
docker compose up layer_trackingOnce the tensors are generated, you can compare the outputs of a target engine against the Hugging Face baseline locally:
python compare_engines.py \
--ref ./tensors/hf_llama_vllm_comp \
--target ./tensors/vllm_llama \
--target-engine vllmIn addition to the benchmarking infrastructure, this repository includes the data and scripts used for the literature and ecosystem surveys presented in the paper.
landscape_survey/: Contains the raw data of our broader LLM inference ecosystem landscape survey, classifying engines, routers, and managed platforms.paper_survey/: Contains the full pipeline used to systematically scrape and analyze ML research papers. Key components include:scraper/: Scripts to pull metadata and PDFs from the ACL Anthology and OpenReview.filters/&unify_papers.py: Utilities to merge datasets, handle symlinking of PDFs, and perform initial keyword filtering.classifier/: An automated, local LLM-as-a-judge pipeline (powered by vLLM) that extracts the reported code repositories, identifies inference engines used by authors, and classifies paper relevance based on detailed prompts.analyze_*.py: Scripts to process the classified data, extract GitHub repository statistics (e.g., verifying dependencies), normalize code domains, and compute overall engine usage statistics.