Skip to content

namdavid2904/LLMServingSim-HoL-Opt

 
 

Repository files navigation

LLMServingSim-HoL-Opt

Research Focus: Mitigating Head-of-Line (HoL) Blocking in Disaggregated Serving

This repository is a research fork of LLMServingSim 2.0. While disaggregated prefill/decode (P/D) architectures successfully isolate interference between phases, our research identifies that intra-phase Head-of-Line (HoL) blocking within the prefill stage remains a critical bottleneck for short-context requests.

Our Goal: To implement and evaluate a new scheduling algorithm designed to optimize the Wait-to-Compute (Slowdown) ratio, ensuring fair resource allocation across bimodal workloads (interleaved long and short prompts).


Key Research Contributions (This Fork)

  • Custom Scheduler Implementation: Located in inference_serving/scheduler.py (see branch research/scheduler-impl), introducing [Algorithm Name] to mitigate prefill-stage HoL blocking.
  • Bimodal Workload Traces: New datasets in dataset/research_hol_blocking/ specifically designed to trigger HoL scenarios by interleaving massive context prompts with short-latency requests.
  • Slowdown Analysis Suite: Extended evaluation scripts to calculate and visualize the Wait-to-Compute ratio and P99 tail latency for short-context victims.

Evaluation Metrics

To prove the efficacy of our algorithm, we focus on the following primary metrics:

  1. Wait-to-Compute Ratio ($Slowdown$): $\frac{Wait_{Time} + Compute_{Time}}{Compute_{Time}}$
  2. Tail Latency (p99 TTFT): Measuring the impact on the most delayed requests.
  3. Jain’s Fairness Index: Mathematically proving the reduction in scheduling unfairness.

Build LLMServingSim

1. Git clone

git clone --recurse-submodules https://github.com/namdavid2904/LLMServingSim-HoL-Opt.git
cd LLMServingSim-HoL-Opt

2. Run Docker

This will configure and run the Docker environment. See docker.sh for details.

./docker.sh

3. Build ASTRA-Sim and Chakra

This will compile ASTRA-Sim (analytical backend) and install Chakra. See compile.sh for details.

./compile.sh

Run LLMServingSim

1. Set input configurations

All configurations for LLMServingSim are generated automatically by inference_serving/config_builder.py from a cluster_config file.

The cluster_config file specifies node topology, instance layout, hardware type, memory hierarchy, and interconnect parameters. It also supports per-layer placement rules for weights, KV cache, and experts, as well as PIM-enabled device configuration.

Config paths:

  • Cluster config: cluster_config/{config_name}.json
  • Logical topology config (ns3 backend only): astra-sim/inputs/logical_topology/{topology_name}.json

Dataset path:

  • Dataset: dataset/{dataset_name}.jsonl
  • Runtime-generated traces: astra-sim/inputs/trace/

See cluster_config/ for example configurations and cluster_config/README.md for the configuration format reference.

2. Run LLMServingSim

Test run:

python main.py \
    --cluster-config 'cluster_config/single_node_single_instance.json' \
    --fp 16 --block-size 16 \
    --dataset 'dataset/sharegpt_req100_rate10_llama.jsonl' \
    --output 'output/example_single_run.csv' \
    --num-req 100 --log-interval 1.0

See run.sh for additional examples covering multi-instance, P/D disaggregation, MoE, prefix caching, CXL memory, PIM, power modeling, and sub-batch interleaving:

./run.sh

Parameters of main.py

The current version supports the following models and hardware:

Models: meta-llama/Llama-3.1-8B, meta-llama/Llama-3.1-70B, microsoft/Phi-mini-MoE-instruct, mistralai/Mixtral-8x7B-v0.1

Hardware: A6000, H100, TPU-v6e-1

New models and hardware can be added using the provided profiler. See Adding a New Model & Hardware.

Parameter Default Description
--cluster-config single_node_single_instance.json Node- and instance-level configuration
--max-batch 0 Maximum batch size; 0 means no limit
--max-num-batched-tokens 2048 Maximum tokens processed per iteration
--fp 16 Floating-point precision in bits
--request-routing-policy RR Request routing across instances (RR, RAND, CUSTOM)
--expert-routing-policy FAST Expert token routing for MoE (RR, RAND, FAST, CUSTOM)
--enable-prefix-caching False Enable prefix caching via RadixAttention
--enable-prefix-sharing False Enable second-tier prefix cache pooling
--prefix-storage None Storage tier for the second-tier prefix pool (None, CPU, CXL)
--enable-local-offloading False Enable weight offloading to local memory
--enable-attn-offloading False Enable attention computation offloading to PIM
--enable-sub-batch-interleaving False Enable sub-batch interleaving for XPU/PIM overlap
--enable-attn-prediction False Enable real-time attention latency prediction
--prioritize-prefill False Prioritize prefill requests in scheduling
--block-size 16 KV cache block size in tokens
--dataset None Path to .jsonl dataset; if None, add requests manually in main.py
--output None Path for per-request CSV output; if None, stdout only
--gen True Set to False to skip the initiation (prefill) phase
--num-req 100 Number of requests to simulate
--log-interval 0.5 Throughput logging interval in seconds
--log-level WARNING Logging verbosity (WARNING, INFO, DEBUG)
--network-backend analytical Network simulation backend (analytical, ns3)

Outputs of main.py

1. Standard output

The simulator reports runtime information through a configurable logger. It logs which requests are processed at each iteration and periodically reports throughput, memory usage, and power consumption.

Adjusting --log-level to INFO or DEBUG enables more detailed output, including per-layer memory load and store activity.

2. Output file

{output_path}.csv contains per-request latency metrics. An example is provided at output/example_run.csv.

Adding a New Model & Hardware

1. Build a performance model

LLMServingSim uses the PyTorch-based profiler in llm_profile/ to generate per-layer latency, attention latency, and power models for a given hardware target. Once profiling is complete, create a cluster config referencing the new hardware name and run main.py as usual.

See llm_profile/README.md for full profiling instructions.

2. Modify simulator functions (optional)

The current version supports Llama-based model architectures. Models that deviate from this architecture may require modifications to the following:

inference_serving/memory_model.py — functions calculate_sizes and get_weight

calculate_sizes computes input, weight, and output tensor sizes for each layer type. get_weight aggregates total model size from calculate_sizes. Modify these according to the target model architecture.

inference_serving/trace_generator.py — function synthesize_trace

This function constructs the per-iteration execution trace by stacking layers according to the model architecture. When modifying it, ensure:

  • The ATTENTION layer is correctly separated per request
  • The output size of layer i matches the input size of layer i+1
  • ALLREDUCE operations are correctly placed for tensor-parallel synchronization

Evaluation

The evaluation/ directory contains the artifact evaluation flow for Figures 5 to 10 from the paper. It includes figure-specific shell scripts, plotting code, parsers, processed reference inputs, and preserved example outputs under evaluation/artifacts/.

Before running artifact evaluation, complete the setup steps above (./docker.sh and ./compile.sh) and run the evaluation commands inside that environment.

Enter evaluation/ first:

cd evaluation

Run an individual figure:

bash figure_5.sh
bash figure_6.sh
bash figure_7.sh
bash figure_8.sh
bash figure_9.sh
bash figure_10.sh

To reproduce the full evaluation set in one pass:

bash run_all.sh

To compare generated parsed outputs against preserved artifact snapshots:

# Compare all figures (5-10)
bash compare.sh
# Compare one figure
bash compare.sh 5
# Compare multiple selected figures
bash compare.sh 5 7 9
# Equivalent single-figure form
bash compare.sh figure_5

For visual validation, compare generated PDFs with the corresponding *_ref.pdf files in each figure folder.

See evaluation/README.md for detailed folder structure, reference-comparison guidance, and per-figure notes.

Publications

ISPASS 2026
LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
Jaehong Cho*, Hyunmin Choi*, Guseul Heo, Jongse Park (KAIST) [Paper] (To Appear)
*Equal contribution
DOI

CAL 2025
LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure
Jaehong Cho, Hyunmin Choi, Jongse Park (KAIST) [Paper]

IISWC 2024
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park (KAIST) [Paper]
DOI

Citation

If you use this fork for HoL blocking research, please cite the original paper and this repository:

@ARTICLE{11224567,
    author={Cho, Jaehong and Choi, Hyunmin and Park, Jongse},
    journal={IEEE Computer Architecture Letters},
    title={{LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving
            Techniques in LLM Infrastructure}},
    year={2025},
    volume={24},
    number={02},
    pages={361-364},
    doi={10.1109/LCA.2025.3628325},
    ISSN={1556-6064},
    publisher={IEEE Computer Society},
    address={Los Alamitos, CA, USA},
    month=jul
}

@INPROCEEDINGS{10763697,
    author={Cho, Jaehong and Kim, Minsu and Choi, Hyunmin and Heo, Guseul and Park, Jongse},
    booktitle={2024 IEEE International Symposium on Workload Characterization (IISWC)},
    title={{LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving
            at Scale}},
    year={2024},
    pages={15-29},
    doi={10.1109/IISWC63097.2024.00012}
}

About

Research fork of LLMServingSim 2.0 investigating Head-of-Line (HoL) blocking mitigation in disaggregated LLM serving via novel scheduling algorithms.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 81.9%
  • Jupyter Notebook 12.8%
  • Shell 5.3%