CCL-Bench is a trace-based benchmark for LLM infrastructure. Each benchmark row is backed by workload metadata and profiler artifacts, so results can be recomputed, audited, and extended as new models, frameworks, hardware, and collective communication libraries are added.
The project is organized around three layers:
- Evidence: workload cards, run metadata, and external profiler traces.
- Analysis: metric tools that consume trace directories and return leaderboard values.
- Presentation: a static website generated from configured trace and metric pairs.
Raw traces are not included. The repository keeps lightweight metadata, scripts, metric code, and generated website data. However, we provide a sample trace for testing purpose, under llama3-torchtitan-nccl-4gpu-fsdp_2-tp_2-b_4-s_512.
| Path | Purpose |
|---|---|
workload_card_template.yaml |
Workload card template for benchmark rows. |
trace_collection/ |
Lightweight workload cards and run scripts. |
trace_gen/ |
Guidance and helpers for collecting profiler traces. |
tools/ |
Metric toolkit. Each metric is implemented as an importable tool. |
website/ |
Static leaderboard and generated benchmark data. |
workload_suite/ |
Standard workload definitions used to compare software and hardware. |
scripts/ |
Reproducibility and collection scripts for specific systems or experiments. |
agent/ |
Experimental/private config tuning agents. |
simulation/ |
Experimental/private trace-based simulation utilities. |
To use our toolkit and test the simulation pipeline, there is no need for GPUs.
Create a local environment:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRun one metric on a trace directory:
python tools/main.py --trace /path/to/trace_dir --metric avg_step_time
# Example
python tools/main.py --trace llama3-torchtitan-nccl-4gpu-fsdp_2-tp_2-b_4-s_512/ --metric avg_step_timeNow test the simulation pipeline.
Build the AstraSim Docker image (required for the simulation pipeline):
docker build -t astra-sim:latest .The Docker build takes 20–40 minutes and produces a ~14 GB image.
Run a what-if simulation on the sample trace (requires the AstraSim Docker image):
# Baseline
python simulation/pipeline.py --mode comm-only \
--trace-dir llama3-torchtitan-nccl-4gpu-fsdp_2-tp_2-b_4-s_512
# What-if: 2× intra-node bandwidth
python simulation/pipeline.py --mode comm-only \
--trace-dir llama3-torchtitan-nccl-4gpu-fsdp_2-tp_2-b_4-s_512 \
--intra-bandwidth 600You can see the results we computed over the traces we collected by running:
python -m http.server 8081
Then open http://localhost:8081.
If you want to render new traces, update website/benchmark_config.json and upload entries. Regenerate the static website data after adding or changing configured traces.
python website/generate_data.py
cd website
python -m http.server 8081- Select a standard workload from
workload_suite/ortrace_collection/workload.md. - Collect profiler artifacts outside the repository. Keep the final trace directory name stable.
- Fill in
workload_card_template.yamland store the card with the trace artifacts. - Add the lightweight workload card under
trace_collection/<workload_name>/when it is useful for review and reproducibility. - Add the trace and metric mapping to
website/benchmark_config.json. - Regenerate
website/benchmark_data.jsonandwebsite/data.js.
Each row should make clear:
- model, phase, precision, dataset, batch size, and sequence lengths;
- hardware type, GPU/TPU count, and per-node count;
- framework and compiler/runtime versions;
- tensor/data/pipeline/expert parallelism;
- communication library and relevant environment variables;
- which trace artifacts were used for each metric.
Metrics are implemented in tools/ and invoked through tools/main.py. The public website uses the subset configured in website/benchmark_config.json; additional tools can remain in the repository for experiments as long as they are documented and do not require checked-in raw traces.
See tools/README.md for the supported metric interface and current dashboard metrics.
Commit:
- source code and scripts required to reproduce a row;
- workload cards and small metadata files;
- generated website JSON/JS when updating the public leaderboard;
- documentation explaining non-obvious trace or environment requirements.
Do not commit:
- virtual environments or package caches;
- raw profiler dumps unless they are intentionally tiny test fixtures;
- local API keys, credentials, or machine-specific scratch paths;
- large intermediate logs that are not part of the artifact.
The canonical shared trace directory is set to be /data/ccl-bench_trace_collection.
This path appears in three places and must be updated consistently if you move traces
to a different mount point or machine:
| Location | How to change |
|---|---|
website/benchmark_config.json — every "trace": path |
Update each path prefix to match your local mount point. The paths must resolve on whichever machine runs python website/generate_data.py. |
agent/ccl_bench_agent/tuning_config.yaml — publish_dir |
Set publish_dir to the desired destination. CCL-Search copies per-iteration traces there. Leave empty to skip publishing. |
If you are running on a different cluster, set publish_dir in tuning_config.yaml and
update the "trace": paths in benchmark_config.json before regenerating the website.
CCL-Search (agent/) and the simulation pipeline (simulation/) are first-class
contributions: CCL-Search automates configuration tuning and records every trial as a
benchmark entry; the simulation pipeline converts traces to Chakra execution graphs for
Astra-Sim what-if analysis. Both require the shared trace directory to be accessible.
