Skip to content

Dormant-Neurons/inference_backends

Repository files navigation

LLM Inference Engine Benchmarking

This repository contains the code to reproduce the experiments evaluating and comparing various LLM inference engines (vLLM, LMDeploy, SGLang, llama.cpp, Ollama, and Hugging Face Transformers), as well as the scripts and data for our ecosystem surveys.

Prerequisites

  • Docker and Docker Compose
  • NVIDIA GPU with drivers supporting CUDA 12+ and the NVIDIA Container Toolkit.
  • Hugging Face Token (for downloading restricted models/datasets).
  • OpenAI API Key (for LLM-as-a-judge benchmarks like SimpleQA and JailbreakBench).

1. Environment Setup

Create a .env file in the root directory (or export these variables to your shell):

HF_TOKEN=your_huggingface_token
OPENAI_API_KEY=your_openai_api_key

2. Download Models & Datasets

We provide a Dockerized setup script to automatically download the necessary Hugging Face datasets and model weights, and convert them to the required formats (like GGUF).

Run the setup container (this may take a while depending on your bandwidth):

docker compose up setup

(Optional) If you are running the Ollama experiments, you also need to initialize the Ollama models:

docker compose up prepare_ollama

3. Running the Main Experiments

The primary entry point for the benchmarks is main.py, which is executed inside the comparison_experiment container.

You can run an experiment by passing a model config and a benchmark config. For example, to evaluate Llama-3 8B on GSM8K using vLLM:

docker compose run --rm comparison_experiment python main.py \
    --model_config configs/model/llama3_8B_vllm.yaml \
    --benchmark_config configs/benchmark/gsm8k.yaml \
    --seed 42 \
    --output_dir ./results

Arguments:

  • --model_config: Path to the engine/model YAML config (see configs/model/).
  • --benchmark_config: Path to the benchmark YAML config (see configs/benchmark/).
  • --seed: Random seed for reproducibility.
  • --output_dir: Where to save the generated JSON results.
  • --run_mode: full (generate and evaluate) or generate_only.

4. Layer Tracking & Engine Comparison (Optional)

To run the internal layer divergence tracking experiment, you must first instrument the inference engines by cloning their source code and applying our tracking modifications.

Step 4.1: Clone the Engine Repositories

Run the following from the root of this project to clone the specific engine versions:

# vLLM
git clone https://github.com/vllm-project/vllm.git ./vllm_source
cd vllm_source && git checkout v0.10.2 -b feat/layer-tracking && cd ..

# Transformers
git clone https://github.com/huggingface/transformers.git ./transformers_source
cd transformers_source && git checkout v4.57.0 -b feat/layer-tracking && cd ..

# SGLang
git clone https://github.com/sgl-project/sglang.git ./sglang_source
cd sglang_source && git checkout v0.5.2 -b feat/layer-tracking && cd ..

# LMDeploy
git clone https://github.com/InternLM/lmdeploy.git ./lmdeploy_source
cd lmdeploy_source && git checkout v0.10.1 -b feat/layer-tracking && cd ..

# llama-cpp-python
git clone https://github.com/abetlen/llama-cpp-python.git ./llama_cpp_python_source
cd llama_cpp_python_source && git checkout v0.3.16 -b feat/layer-tracking
git submodule update --init --recursive
cd ..

# Ollama
git clone https://github.com/ollama/ollama.git ./ollama_source
cd ollama_source && git checkout v0.13.5 -b feat/layer-tracking && cd ..

Step 4.2: Apply Modified Source Files

Copy the contents of the included modified_source directory into their respective cloned repositories. This will overwrite the original files with our instrumented versions that dump the hidden states.

Step 4.3: Build Custom Ollama Archive

Because Ollama is distributed as a Go binary, we need to pre-build our modified version into a tarball so the Docker container can install it.

cd ollama_source

DOCKER_BUILDKIT=1 docker buildx build \
  --platform linux/amd64 \
  --target archive \
  --output type=local,dest=custom-dist \
  .

cd custom-dist
tar -czvf ../../custom-ollama-linux-amd64.tgz .

# Return to project root
cd ../../

Step 4.4: Rebuild and Run the Tracking Container

Now, build the tracking Docker image (which will install the local, modified python packages) and run the generation script:

# Build the tracking image
docker compose build layer_tracking

# Run the tracking script to generate intermediate tensors
docker compose up layer_tracking

Step 4.5: Compare Engines

Once the tensors are generated, you can compare the outputs of a target engine against the Hugging Face baseline locally:

python compare_engines.py \
    --ref ./tensors/hf_llama_vllm_comp \
    --target ./tensors/vllm_llama \
    --target-engine vllm

5. Paper & Landscape Surveys

In addition to the benchmarking infrastructure, this repository includes the data and scripts used for the literature and ecosystem surveys presented in the paper.

  • landscape_survey/: Contains the raw data of our broader LLM inference ecosystem landscape survey, classifying engines, routers, and managed platforms.
  • paper_survey/: Contains the full pipeline used to systematically scrape and analyze ML research papers. Key components include:
    • scraper/: Scripts to pull metadata and PDFs from the ACL Anthology and OpenReview.
    • filters/ & unify_papers.py: Utilities to merge datasets, handle symlinking of PDFs, and perform initial keyword filtering.
    • classifier/: An automated, local LLM-as-a-judge pipeline (powered by vLLM) that extracts the reported code repositories, identifies inference engines used by authors, and classifies paper relevance based on detailed prompts.
    • analyze_*.py: Scripts to process the classified data, extract GitHub repository statistics (e.g., verifying dependencies), normalize code domains, and compute overall engine usage statistics.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages