LLMServingSim-HoL-Opt

Research Focus: Mitigating Head-of-Line (HoL) Blocking in Disaggregated Serving

This repository is a research fork of LLMServingSim 2.0. While disaggregated prefill/decode (P/D) architectures successfully isolate interference between phases, our research identifies that intra-phase Head-of-Line (HoL) blocking within the prefill stage remains a critical bottleneck for short-context requests.

Our Goal: To implement and evaluate a new scheduling algorithm designed to optimize the Wait-to-Compute (Slowdown) ratio, ensuring fair resource allocation across bimodal workloads (interleaved long and short prompts).

Key Research Contributions (This Fork)

Custom Scheduler Implementation: Located in inference_serving/scheduler.py (see branch research/scheduler-impl), introducing [Algorithm Name] to mitigate prefill-stage HoL blocking.
Bimodal Workload Traces: New datasets in dataset/research_hol_blocking/ specifically designed to trigger HoL scenarios by interleaving massive context prompts with short-latency requests.
Slowdown Analysis Suite: Extended evaluation scripts to calculate and visualize the Wait-to-Compute ratio and P99 tail latency for short-context victims.

Evaluation Metrics

To prove the efficacy of our algorithm, we focus on the following primary metrics:

Wait-to-Compute Ratio ($Slowdown$): $\frac{Wait_{Time} + Compute_{Time}}{Compute_{Time}}$
Tail Latency (p99 TTFT): Measuring the impact on the most delayed requests.
Jain’s Fairness Index: Mathematically proving the reduction in scheduling unfairness.

Build LLMServingSim

1. Git clone

git clone --recurse-submodules https://github.com/namdavid2904/LLMServingSim-HoL-Opt.git
cd LLMServingSim-HoL-Opt

2. Run Docker

This will configure and run the Docker environment. See docker.sh for details.

./docker.sh

3. Build ASTRA-Sim and Chakra

This will compile ASTRA-Sim (analytical backend) and install Chakra. See compile.sh for details.

./compile.sh

Run LLMServingSim

1. Set input configurations

All configurations for LLMServingSim are generated automatically by inference_serving/config_builder.py from a cluster_config file.

The cluster_config file specifies node topology, instance layout, hardware type, memory hierarchy, and interconnect parameters. It also supports per-layer placement rules for weights, KV cache, and experts, as well as PIM-enabled device configuration.

Config paths:

Cluster config: cluster_config/{config_name}.json
Logical topology config (ns3 backend only): astra-sim/inputs/logical_topology/{topology_name}.json

Dataset path:

Dataset: dataset/{dataset_name}.jsonl
Runtime-generated traces: astra-sim/inputs/trace/

See cluster_config/ for example configurations and cluster_config/README.md for the configuration format reference.

2. Run LLMServingSim

Test run:

python main.py \
    --cluster-config 'cluster_config/single_node_single_instance.json' \
    --fp 16 --block-size 16 \
    --dataset 'dataset/sharegpt_req100_rate10_llama.jsonl' \
    --output 'output/example_single_run.csv' \
    --num-req 100 --log-interval 1.0

See run.sh for additional examples covering multi-instance, P/D disaggregation, MoE, prefix caching, CXL memory, PIM, power modeling, and sub-batch interleaving:

./run.sh

Parameters of `main.py`

The current version supports the following models and hardware:

Models: meta-llama/Llama-3.1-8B, meta-llama/Llama-3.1-70B, microsoft/Phi-mini-MoE-instruct, mistralai/Mixtral-8x7B-v0.1

Hardware: A6000, H100, TPU-v6e-1

New models and hardware can be added using the provided profiler. See Adding a New Model & Hardware.

Parameter	Default	Description
`--cluster-config`	`single_node_single_instance.json`	Node- and instance-level configuration
`--max-batch`	`0`	Maximum batch size; `0` means no limit
`--max-num-batched-tokens`	`2048`	Maximum tokens processed per iteration
`--fp`	`16`	Floating-point precision in bits
`--request-routing-policy`	`RR`	Request routing across instances (`RR`, `RAND`, `CUSTOM`)
`--expert-routing-policy`	`FAST`	Expert token routing for MoE (`RR`, `RAND`, `FAST`, `CUSTOM`)
`--enable-prefix-caching`	`False`	Enable prefix caching via RadixAttention
`--enable-prefix-sharing`	`False`	Enable second-tier prefix cache pooling
`--prefix-storage`	`None`	Storage tier for the second-tier prefix pool (`None`, `CPU`, `CXL`)
`--enable-local-offloading`	`False`	Enable weight offloading to local memory
`--enable-attn-offloading`	`False`	Enable attention computation offloading to PIM
`--enable-sub-batch-interleaving`	`False`	Enable sub-batch interleaving for XPU/PIM overlap
`--enable-attn-prediction`	`False`	Enable real-time attention latency prediction
`--prioritize-prefill`	`False`	Prioritize prefill requests in scheduling
`--block-size`	`16`	KV cache block size in tokens
`--dataset`	`None`	Path to `.jsonl` dataset; if `None`, add requests manually in `main.py`
`--output`	`None`	Path for per-request CSV output; if `None`, stdout only
`--gen`	`True`	Set to `False` to skip the initiation (prefill) phase
`--num-req`	`100`	Number of requests to simulate
`--log-interval`	`0.5`	Throughput logging interval in seconds
`--log-level`	`WARNING`	Logging verbosity (`WARNING`, `INFO`, `DEBUG`)
`--network-backend`	`analytical`	Network simulation backend (`analytical`, `ns3`)

Outputs of `main.py`

1. Standard output

The simulator reports runtime information through a configurable logger. It logs which requests are processed at each iteration and periodically reports throughput, memory usage, and power consumption.

Adjusting --log-level to INFO or DEBUG enables more detailed output, including per-layer memory load and store activity.

2. Output file

{output_path}.csv contains per-request latency metrics. An example is provided at output/example_run.csv.

Adding a New Model & Hardware

1. Build a performance model

LLMServingSim uses the PyTorch-based profiler in llm_profile/ to generate per-layer latency, attention latency, and power models for a given hardware target. Once profiling is complete, create a cluster config referencing the new hardware name and run main.py as usual.

See llm_profile/README.md for full profiling instructions.

2. Modify simulator functions (optional)

The current version supports Llama-based model architectures. Models that deviate from this architecture may require modifications to the following:

inference_serving/memory_model.py — functions calculate_sizes and get_weight

calculate_sizes computes input, weight, and output tensor sizes for each layer type. get_weight aggregates total model size from calculate_sizes. Modify these according to the target model architecture.

inference_serving/trace_generator.py — function synthesize_trace

This function constructs the per-iteration execution trace by stacking layers according to the model architecture. When modifying it, ensure:

The ATTENTION layer is correctly separated per request
The output size of layer i matches the input size of layer i+1
ALLREDUCE operations are correctly placed for tensor-parallel synchronization

Evaluation

The evaluation/ directory contains the artifact evaluation flow for Figures 5 to 10 from the paper. It includes figure-specific shell scripts, plotting code, parsers, processed reference inputs, and preserved example outputs under evaluation/artifacts/.

Before running artifact evaluation, complete the setup steps above (./docker.sh and ./compile.sh) and run the evaluation commands inside that environment.

Enter evaluation/ first:

cd evaluation

Run an individual figure:

bash figure_5.sh
bash figure_6.sh
bash figure_7.sh
bash figure_8.sh
bash figure_9.sh
bash figure_10.sh

To reproduce the full evaluation set in one pass:

bash run_all.sh

To compare generated parsed outputs against preserved artifact snapshots:

# Compare all figures (5-10)
bash compare.sh
# Compare one figure
bash compare.sh 5
# Compare multiple selected figures
bash compare.sh 5 7 9
# Equivalent single-figure form
bash compare.sh figure_5

For visual validation, compare generated PDFs with the corresponding *_ref.pdf files in each figure folder.

See evaluation/README.md for detailed folder structure, reference-comparison guidance, and per-figure notes.

Publications

ISPASS 2026
LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
Jaehong Cho^*, Hyunmin Choi^*, Guseul Heo, Jongse Park (KAIST) [Paper] (To Appear)
^*Equal contribution

CAL 2025
LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure
Jaehong Cho, Hyunmin Choi, Jongse Park (KAIST) [Paper]

IISWC 2024
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park (KAIST) [Paper]

Citation

If you use this fork for HoL blocking research, please cite the original paper and this repository:

@ARTICLE{11224567,
    author={Cho, Jaehong and Choi, Hyunmin and Park, Jongse},
    journal={IEEE Computer Architecture Letters},
    title={{LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving
            Techniques in LLM Infrastructure}},
    year={2025},
    volume={24},
    number={02},
    pages={361-364},
    doi={10.1109/LCA.2025.3628325},
    ISSN={1556-6064},
    publisher={IEEE Computer Society},
    address={Los Alamitos, CA, USA},
    month=jul
}

@INPROCEEDINGS{10763697,
    author={Cho, Jaehong and Kim, Minsu and Choi, Hyunmin and Heo, Guseul and Park, Jongse},
    booktitle={2024 IEEE International Symposium on Workload Characterization (IISWC)},
    title={{LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving
            at Scale}},
    year={2024},
    pages={15-29},
    doi={10.1109/IISWC63097.2024.00012}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMServingSim-HoL-Opt

Research Focus: Mitigating Head-of-Line (HoL) Blocking in Disaggregated Serving

Key Research Contributions (This Fork)

Evaluation Metrics

Build LLMServingSim

1. Git clone

2. Run Docker

3. Build ASTRA-Sim and Chakra

Run LLMServingSim

1. Set input configurations

2. Run LLMServingSim

Parameters of `main.py`

Outputs of `main.py`

1. Standard output

2. Output file

Adding a New Model & Hardware

1. Build a performance model

2. Modify simulator functions (optional)

Evaluation

Publications

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
astra-sim @ 096c703		astra-sim @ 096c703
cluster_config		cluster_config
dataset		dataset
docs		docs
evaluation		evaluation
inference_serving		inference_serving
llm_profile		llm_profile
model_config		model_config
output		output
pim_config		pim_config
script		script
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
compile.sh		compile.sh
docker.sh		docker.sh
main.py		main.py
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

LLMServingSim-HoL-Opt

Research Focus: Mitigating Head-of-Line (HoL) Blocking in Disaggregated Serving

Key Research Contributions (This Fork)

Evaluation Metrics

Build LLMServingSim

1. Git clone

2. Run Docker

3. Build ASTRA-Sim and Chakra

Run LLMServingSim

1. Set input configurations

2. Run LLMServingSim

Parameters of main.py

Outputs of main.py

1. Standard output

2. Output file

Adding a New Model & Hardware

1. Build a performance model

2. Modify simulator functions (optional)

Evaluation

Publications

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Parameters of `main.py`

Outputs of `main.py`

Packages