- Haojia Sun: haojias@andrew.cmu.edu
- Yuling Wang: yulingwa@andrew.cmu.edu
This repository is a research prototype for long-context decode and continuous batching on GPUs.
The main question is:
When does page-level dynamic sparsity beat dense decode once page scoring, top-k selection, KV gather, and batching overheads are included?
The repo currently contains four main decode paths:
dense_cpu: dense attention on CPUsparse_cpu: CPU reference sparse pipeline withscore -> topk -> gather -> attenddense_gpu: dense CUDA baselinesparse_gpu: sparse CUDA path on H100/sm90, with GPU page scoring, GPU top-k selection, GPU gather, and GPU sparse attention; sparse layout preparation still includes host-side metadata work
It also contains continuous-serving benchmark paths:
continuous_sparse: continuous batching over the sparse GPU decode pathcontinuous_dense_gpu: continuous batching over the dense CUDA decode pathserial_sparse: one-active-request sparse baseline used for continuous-vs-serial comparison
The sparse decode pipeline is built on top of explicit page-pool abstractions for both host and device memory.
PagePool: contiguous host-side K/V storage with page allocation, free-list reuse, and slot-based addressingPagedKvCache: logical request-to-page mapping on top ofPagePool, including page descriptors, page summaries, and request page order trackingDevicePagePool: CUDA-resident mirror for page payloads and metadata used by the sparse GPU path
These modules make it possible to separate logical page selection from the physical storage layout used by CPU and GPU decode paths.
For continuous batching, request prompts are admitted into the paged cache, active requests are packed into per-step batches, generated tokens are appended back to the cache, and finished requests release their pages for reuse. The default prompt admission mode is direct GPU upload.
Basic build:
cmake -S . -B build
cmake --build build -j
ctest --test-dir build --output-on-failureFor CUDA builds on H100, make sure nvcc is visible first. One working example is:
export PATH=/opt/packages/cuda/v12.4.0/bin:$PATH
export LIBRARY_PATH=/opt/packages/cuda/v12.4.0/lib64:$LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/packages/cuda/v12.4.0/lib64:$LD_LIBRARY_PATH
rm -rf build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DDSD_ENABLE_CUDA=ON -DDSD_CUDA_ARCHITECTURES=90Run the demo:
./build/dsd_demoRun sparse CUDA correctness test:
./build/dsd_sparse_cuda_testRun dense CUDA correctness test:
./build/dsd_dense_cuda_testRun the benchmark:
./build/dsd_bench [top_k_pages] [batch_size] [min_context_tokens] [max_context_tokens] [seed] [iterations] [warmup] [num_heads] [head_dim] [page_size]Example:
./build/dsd_bench 8 16 512 2048 7 10 2 32 128 16Single-request example:
./build/dsd_bench 8 1 1024 1024 7 10 2Run the continuous batching benchmark:
./build/dsd_continuous_bench \
[num_requests] [max_active_requests] [arrival_window] \
[min_prompt_tokens] [max_prompt_tokens] \
[min_decode_steps] [max_decode_steps] [seed] \
[num_heads] [head_dim] [page_size] [top_k_pages] \
[admission_mode] [preadmit_prompts] [precompute_decode_payloads] \
[gpu_synthetic_decode_append] [lazy_release] \
[run_batch_output_mode] [run_batch_timing_mode] [benchmark_mode]Important continuous benchmark modes:
admission_mode:0=cpu_cache,1=direct_gpu_upload,2=synthetic_gpu_prefillrun_batch_output_mode:0=full_outputs,1=outputs_only,2=no_outputsrun_batch_timing_mode:0=host_wall,1=kernel_events,2=disabledbenchmark_mode:0=sparse,1=dense_gpu
Example sparse continuous run:
./build/dsd_continuous_bench 128 32 64 1024 4096 16 64 7 32 128 16 8Example dense GPU continuous run:
./build/dsd_continuous_bench 128 32 64 1024 4096 16 64 7 32 128 16 8 1 0 0 0 0 1 1 1dsd_bench now prints three sections before the accuracy summary.
Analytical Ceiling
- exact batch totals for context tokens, candidate pages, and selected tokens
- dense, current sparse, and target sparse byte/FLOP estimates
- H100 bandwidth and FP32 lower-bound times
End-to-End
- Total wall-clock time for each path
- Speedup relative to
dense_cpu
Stage / Kernel Breakdown
dense_cpu: CPU dense attention timesparse_cpu: CPU score, top-k, gather, attention breakdowndense_gpu: CUDA dense attention kernel timesparse_gpu: batched GPU score/top-k plus fused sparse attention breakdown
It also prints an Accuracy block with max-absolute-difference comparisons:
sparse_cpuvsdense_cpusparse_gpuvssparse_cpusparse_gpuvsdense_cpudense_gpuvsdense_cpu
dsd_continuous_bench prints key-value output for continuous serving runs.
The sparse mode reports both continuous_sparse_* and serial_sparse_*
statistics plus continuous_vs_serial_speedup. Dense GPU mode reports
continuous_dense_gpu_* statistics.
Important continuous fields include:
*_tokens_per_second,*_avg_active_batch_size,*_avg_step_ms,*_p95_step_ms*_measured_end_to_end_ms,*_total_ms,*_run_batch_wall_ms,*_outside_run_batch_ms*_admission_ms,*_prompt_preadmit_ms,*_runtime_admission_ms*_append_sync_ms,*_release_sync_ms,*_serving_loop_other_ms*_avg_score_ms,*_avg_topk_ms,*_avg_attention_ms,*_avg_kernel_ms*_avg_h2d_ms,*_avg_d2h_ms,*_avg_launch_ms,*_avg_sync_ms*_device_h2d_bytes,*_device_d2h_bytes, and transfer call counters
The experiment/ directory contains report-oriented sweep scripts and matching
plot scripts. Each shell script writes raw stdout and CSV files under
results/report_experiments/; each plotting script reads the CSV and writes
figures under that experiment's figures/ directory.
./experiment/00_backend_total_time.sh
./experiment/01_decode_context_length_sweep.sh
./experiment/02_decode_batch_scaling_sweep.sh
./experiment/03_decode_topk_sparsity_sweep.sh
./experiment/04_decode_page_size_sweep.sh
./experiment/05_continuous_batching_load_sweep.sh
./experiment/06_continuous_runtime_options_ablation.sh
./experiment/07_attention_backend_breakdown.shPlotting example:
./experiment/plot_05_continuous_batching_load_sweep.pySee experiment/README.md for the report question and key CSV columns for each
sweep.
include/dsd/
config.h model/runtime configuration
cuda_utils.h CUDA helpers and checked device buffers
types.h common pipeline data structures
page_pool.h host-side page allocator for contiguous KV storage
paged_kv_cache.h paged KV-cache abstraction
device_page_pool.h CUDA-resident page pool and page metadata storage
reference_kernels.h CPU reference scoring/top-k/gather/attention
cuda_dense_attention.h dense CUDA interface
cuda_sparse_attention.h sparse CUDA interface
decode_pipeline.h end-to-end sparse and dense decode driver
continuous_batching.h synthetic continuous-serving workload and runners
profiler.h lightweight stage timer
synthetic_data.h ragged synthetic batch generator
src/
main.cpp
page_pool.cpp
paged_kv_cache.cpp
reference_kernels.cpp
decode_pipeline.cpp
continuous_batching.cpp
synthetic_data.cpp
device_page_pool.cpp CUDA-backed page-pool implementation
cuda_dense_attention_stub.cpp
cuda_sparse_attention_stub.cpp
cuda/kernels.cu dense CUDA baseline
cuda/sparse_batched_attention.cu batched sparse CUDA score/top-k/fused-attention
benchmarks/
bench_decode.cpp benchmark harness for dense/sparse CPU/GPU comparison
bench_continuous_decode.cpp continuous batching benchmark harness
experiment/
README.md report experiment guide
*.sh reproducible experiment sweeps
plot_*.py CSV-to-figure plotting scripts
tests/
test_page_pool.cpp
test_reference.cpp
test_device_page_pool.cpp
test_dense_cuda.cpp
test_sparse_cuda.cpp
test_continuous_batching.cpp
dense_gpuandsparse_gpuboth use persistent HBM contexts for static KV/page data.sparse_gpuis now batch-oriented; gather is fused into attention in the fast path, sogather_msis expected to be zero.sparse_gputotal time still includes host/device copy, kernel launch, synchronization, and sparse layout preparation overhead.- Continuous sparse and dense GPU benchmarks share the same synthetic workload generator and request admission/release model.
- Continuous sparse reports both continuous batching and one-active-request serial sparse results; dense GPU mode reports only the dense continuous path.