Skip to content

fix(vllm): data parallel fixes and benchmarks#1097

Open
dacorvo wants to merge 7 commits intomainfrom
data_parallel_benchmarks
Open

fix(vllm): data parallel fixes and benchmarks#1097
dacorvo wants to merge 7 commits intomainfrom
data_parallel_benchmarks

Conversation

@dacorvo
Copy link
Copy Markdown
Collaborator

@dacorvo dacorvo commented Mar 18, 2026

What does this PR do?

Fixes several issues with the vLLM data parallel (DP) server-manager path and updates benchmarks.

Fixes

  • serve.py: accept --data_parallel_size (underscore form) to match other CLI args like --tensor_parallel_size
  • server_manager.py: restrict NEURON_RT_VISIBLE_CORES in DP=1 path so single-worker doesn't claim all cores
  • server_manager.py: isolate DP worker workdirs (BASE_COMPILE_WORK_DIR, CWD) to prevent compile races on temp files
  • server_manager.py: pass --data-parallel-size and --data-parallel-rank to each vLLM subprocess so vLLM's set_device_control_env_var() correctly propagates NEURON_RT_VISIBLE_CORES to EngineCore processes

Benchmarks

  • Flatten vLLM benchmark directory structure and add unified serve.sh
  • Add Qwen3-30B-A3B DP scaling results on inf2.48xlarge

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dacorvo dacorvo force-pushed the data_parallel_benchmarks branch from 2ab80f3 to bd9a8a2 Compare March 19, 2026 17:59
@dacorvo dacorvo changed the title Data parallel benchmarks fix(vllm): data parallel fixes and benchmarks Mar 20, 2026
@dacorvo dacorvo marked this pull request as ready for review March 20, 2026 07:47
@dacorvo dacorvo requested review from Copilot and tengomucho March 20, 2026 07:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes correctness and isolation issues in the vLLM data-parallel (DP) server-manager execution path (core visibility, per-rank working directories, and DP env propagation), and refreshes/reshapes the benchmark artifacts to a flatter, unified layout with updated DP scaling results.

Changes:

  • Add CLI compatibility for --data_parallel_size and improve DP=1 core visibility handling in optimum-cli neuron serve.
  • Improve DP subprocess isolation and DP rank propagation in the vLLM server manager (separate workdirs/CWD and pass --data-parallel-size/--data-parallel-rank).
  • Flatten benchmark layout with a unified benchmark/vllm/serve.sh, updated docs, and new/updated result/config files while removing the older docker-compose based DP benchmark setup.

Reviewed changes

Copilot reviewed 41 out of 51 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
optimum/neuron/vllm/server_manager.py Isolate per-rank compile/CWD and pass DP size/rank flags to each vLLM subprocess.
optimum/commands/neuron/serve.py Accept underscore DP flag alias and restrict visible NeuronCores for DP=1 runs.
benchmark/vllm/single-instance/serve.sh Remove legacy single-instance benchmark serve script.
benchmark/vllm/single-instance/README.md Remove legacy single-instance benchmark README.
benchmark/vllm/single-instance/qwen3-30B-A3B/.env Remove legacy single-instance env config.
benchmark/vllm/single-instance/qwen3-30B-A3B-trn2/.env Remove legacy single-instance env config (trn2).
benchmark/vllm/single-instance/qwen3-235B-A22B-trn2/.env Remove legacy single-instance env config (trn2).
benchmark/vllm/serve.sh Add unified serve wrapper supporting optional DP.
benchmark/vllm/README.md Update benchmark instructions to the new unified layout and DP flow.
benchmark/vllm/qwen3-32B-trn2/vllm-results.csv Add benchmark results dataset.
benchmark/vllm/qwen3-32B-trn2/serve-dp1-tp4.env Normalize env format and add explicit DATA_PARALLEL_SIZE.
benchmark/vllm/qwen3-30B-A3B/vllm-results.csv Add benchmark results dataset.
benchmark/vllm/qwen3-30B-A3B/serve-dp2-tp16.env Add serve configuration for DP/TP combo.
benchmark/vllm/qwen3-30B-A3B/serve-dp1-tp32.env Add serve configuration for DP/TP combo.
benchmark/vllm/qwen3-30B-A3B-trn2/vllm-results.csv Add benchmark results dataset (trn2).
benchmark/vllm/qwen3-30B-A3B-trn2/serve-dp1-tp4.env Add serve configuration (trn2).
benchmark/vllm/qwen3-30B-A3B-2507/vllm-results.csv Add DP scaling benchmark dataset (inf2).
benchmark/vllm/qwen3-30B-A3B-2507/serve-dp3-tp8.env Add serve configuration for DP scaling.
benchmark/vllm/qwen3-30B-A3B-2507/serve-dp2-tp8.env Add serve configuration for DP scaling.
benchmark/vllm/qwen3-30B-A3B-2507/serve-dp1-tp8.env Add serve configuration for DP scaling.
benchmark/vllm/qwen3-30B-A3B-2507/README.md Add narrative DP scaling report for Qwen3-30B-A3B on inf2.48xlarge.
benchmark/vllm/qwen3-235B-A22B-trn2/vllm-results.csv Add benchmark results dataset (trn2).
benchmark/vllm/qwen3-235B-A22B-trn2/serve-dp1-tp64.env Add serve configuration (trn2).
benchmark/vllm/llama4-Scout/vllm-results.csv Add benchmark results dataset.
benchmark/vllm/llama4-Scout/serve-dp1-tp32.env Add explicit DATA_PARALLEL_SIZE to serve config.
benchmark/vllm/llama4-Scout-trn2/vllm-results.csv Add benchmark results dataset (trn2).
benchmark/vllm/llama4-Scout-trn2/serve-dp1-tp64.env Add explicit DATA_PARALLEL_SIZE to serve config (trn2).
benchmark/vllm/llama4-Maverick-trn2/vllm-results.csv Add benchmark results dataset (trn2).
benchmark/vllm/llama4-Maverick-trn2/serve-dp1-tp64.env Add explicit DATA_PARALLEL_SIZE to serve config (trn2).
benchmark/vllm/llama3-70B-trn2/serve-dp2-tp32.env Add serve configuration for DP=2 / TP=32.
benchmark/vllm/llama-3.1-8b/vllm-results-dp4.csv Add DP=4 benchmark results dataset.
benchmark/vllm/llama-3.1-8b/vllm-results-dp3.csv Add DP=3 benchmark results dataset.
benchmark/vllm/llama-3.1-8b/vllm-results-cli-dp.csv Add CLI-based DP benchmark results dataset.
benchmark/vllm/llama-3.1-8b/serve-dp3-tp8.env Add serve configuration for DP scaling.
benchmark/vllm/llama-3.1-8b/serve-dp2-tp8.env Add serve configuration for DP scaling.
benchmark/vllm/llama-3.1-8b/serve-dp1-tp8.env Add serve configuration for DP scaling.
benchmark/vllm/llama-3.1-8b/dp-benchmark-results.md Add summarized DP scaling report for Llama-3.1-8B.
benchmark/vllm/llama-3.1-8b-trn2/vllm-results.csv Add benchmark results dataset (trn2).
benchmark/vllm/llama-3.1-8b-trn2/serve-dp1-tp4.env Add explicit DATA_PARALLEL_SIZE to serve config (trn2).
benchmark/vllm/data-parallel/README.md Remove legacy docker-compose based DP benchmark instructions.
benchmark/vllm/data-parallel/qwen3-30B-A3B/nginx.conf Remove legacy nginx LB config for docker-compose DP setup.
benchmark/vllm/data-parallel/qwen3-30B-A3B/docker-compose.yaml Remove legacy docker-compose DP deployment.
benchmark/vllm/data-parallel/qwen3-30B-A3B/.env Remove legacy docker-compose env file.
benchmark/vllm/data-parallel/llama3.1-8b/nginx-dp4.conf Remove legacy nginx LB config (dp4).
benchmark/vllm/data-parallel/llama3.1-8b/nginx-dp3.conf Remove legacy nginx LB config (dp3).
benchmark/vllm/data-parallel/llama3.1-8b/docker-compose-dp4.yaml Remove legacy docker-compose DP deployment (dp4).
benchmark/vllm/data-parallel/llama3.1-8b/docker-compose-dp3.yaml Remove legacy docker-compose DP deployment (dp3).
benchmark/vllm/data-parallel/llama3.1-8b/.env Remove legacy docker-compose env file.
benchmark/vllm/data-parallel/llama3-70B-trn2/nginx.conf Remove legacy nginx LB config for docker-compose DP setup.
benchmark/vllm/data-parallel/llama3-70B-trn2/docker-compose.yaml Remove legacy docker-compose DP deployment.
benchmark/vllm/data-parallel/llama3-70B-trn2/.env Remove legacy docker-compose env file.

dacorvo and others added 6 commits March 26, 2026 07:50
When `optimum-cli neuron serve` runs with DP=1, it previously did not
set NEURON_RT_VISIBLE_CORES, causing NxD to claim all available
NeuronCores on the instance instead of just the tensor_parallel_size
cores needed. This blocked subsequent DP replicas or other workloads
from using the remaining cores.

Reuse VLLMServerManager._resolve_physical_cores() to parse the existing
env var and restrict to the first TP cores.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e.sh

Replace the split single-instance/ and data-parallel/ directory structure
(with docker-compose + nginx configs) with a flat layout where each model
has serve-dpX-tpY.env files sourced by a single serve.sh script that
calls `optimum-cli neuron serve`. Add Llama 3.1 8B data-parallel
benchmark results (DP=1/2/3 on inf2.48xlarge, TP=8).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When data_parallel_size > 1, the server manager spawns multiple vLLM
processes that each independently export/compile the model. These
processes were sharing /tmp/nxd_model/ as the compiler workdir and the
same CWD, causing races on HLO temp files and shutil.rmtree collisions.

Give each DP rank its own BASE_COMPILE_WORK_DIR and CWD.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3-30B-A3B-Instruct-2507 benchmarked with TP=8, SL=4096 across
DP1/DP2/DP3 at BS=4 and BS=32. BS=32 is the sweet spot at 124 tok/s
(DP3), while BS=64 exceeds device memory. Full results in CSV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tell vLLM each worker's DP rank so its set_device_control_env_var()
correctly propagates NEURON_RT_VISIBLE_CORES to EngineCore processes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dacorvo dacorvo force-pushed the data_parallel_benchmarks branch from bd9a8a2 to 7add424 Compare March 26, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants