fix(vllm): data parallel fixes and benchmarks by dacorvo · Pull Request #1097 · huggingface/optimum-neuron

dacorvo · 2026-03-18T08:23:18Z

What does this PR do?

Fixes several issues with the vLLM data parallel (DP) server-manager path and updates benchmarks.

Fixes

serve.py: accept --data_parallel_size (underscore form) to match other CLI args like --tensor_parallel_size
server_manager.py: restrict NEURON_RT_VISIBLE_CORES in DP=1 path so single-worker doesn't claim all cores
server_manager.py: isolate DP worker workdirs (BASE_COMPILE_WORK_DIR, CWD) to prevent compile races on temp files
server_manager.py: pass --data-parallel-size and --data-parallel-rank to each vLLM subprocess so vLLM's set_device_control_env_var() correctly propagates NEURON_RT_VISIBLE_CORES to EngineCore processes

Benchmarks

Flatten vLLM benchmark directory structure and add unified serve.sh
Add Qwen3-30B-A3B DP scaling results on inf2.48xlarge

HuggingFaceDocBuilderDev · 2026-03-18T08:25:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copilot

Pull request overview

This PR fixes correctness and isolation issues in the vLLM data-parallel (DP) server-manager execution path (core visibility, per-rank working directories, and DP env propagation), and refreshes/reshapes the benchmark artifacts to a flatter, unified layout with updated DP scaling results.

Changes:

Add CLI compatibility for --data_parallel_size and improve DP=1 core visibility handling in optimum-cli neuron serve.
Improve DP subprocess isolation and DP rank propagation in the vLLM server manager (separate workdirs/CWD and pass --data-parallel-size/--data-parallel-rank).
Flatten benchmark layout with a unified benchmark/vllm/serve.sh, updated docs, and new/updated result/config files while removing the older docker-compose based DP benchmark setup.

Reviewed changes

Copilot reviewed 41 out of 51 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
optimum/neuron/vllm/server_manager.py	Isolate per-rank compile/CWD and pass DP size/rank flags to each vLLM subprocess.
optimum/commands/neuron/serve.py	Accept underscore DP flag alias and restrict visible NeuronCores for DP=1 runs.
benchmark/vllm/single-instance/serve.sh	Remove legacy single-instance benchmark serve script.
benchmark/vllm/single-instance/README.md	Remove legacy single-instance benchmark README.
benchmark/vllm/single-instance/qwen3-30B-A3B/.env	Remove legacy single-instance env config.
benchmark/vllm/single-instance/qwen3-30B-A3B-trn2/.env	Remove legacy single-instance env config (trn2).
benchmark/vllm/single-instance/qwen3-235B-A22B-trn2/.env	Remove legacy single-instance env config (trn2).
benchmark/vllm/serve.sh	Add unified serve wrapper supporting optional DP.
benchmark/vllm/README.md	Update benchmark instructions to the new unified layout and DP flow.
benchmark/vllm/qwen3-32B-trn2/vllm-results.csv	Add benchmark results dataset.
benchmark/vllm/qwen3-32B-trn2/serve-dp1-tp4.env	Normalize env format and add explicit `DATA_PARALLEL_SIZE`.
benchmark/vllm/qwen3-30B-A3B/vllm-results.csv	Add benchmark results dataset.
benchmark/vllm/qwen3-30B-A3B/serve-dp2-tp16.env	Add serve configuration for DP/TP combo.
benchmark/vllm/qwen3-30B-A3B/serve-dp1-tp32.env	Add serve configuration for DP/TP combo.
benchmark/vllm/qwen3-30B-A3B-trn2/vllm-results.csv	Add benchmark results dataset (trn2).
benchmark/vllm/qwen3-30B-A3B-trn2/serve-dp1-tp4.env	Add serve configuration (trn2).
benchmark/vllm/qwen3-30B-A3B-2507/vllm-results.csv	Add DP scaling benchmark dataset (inf2).
benchmark/vllm/qwen3-30B-A3B-2507/serve-dp3-tp8.env	Add serve configuration for DP scaling.
benchmark/vllm/qwen3-30B-A3B-2507/serve-dp2-tp8.env	Add serve configuration for DP scaling.
benchmark/vllm/qwen3-30B-A3B-2507/serve-dp1-tp8.env	Add serve configuration for DP scaling.
benchmark/vllm/qwen3-30B-A3B-2507/README.md	Add narrative DP scaling report for Qwen3-30B-A3B on inf2.48xlarge.
benchmark/vllm/qwen3-235B-A22B-trn2/vllm-results.csv	Add benchmark results dataset (trn2).
benchmark/vllm/qwen3-235B-A22B-trn2/serve-dp1-tp64.env	Add serve configuration (trn2).
benchmark/vllm/llama4-Scout/vllm-results.csv	Add benchmark results dataset.
benchmark/vllm/llama4-Scout/serve-dp1-tp32.env	Add explicit `DATA_PARALLEL_SIZE` to serve config.
benchmark/vllm/llama4-Scout-trn2/vllm-results.csv	Add benchmark results dataset (trn2).
benchmark/vllm/llama4-Scout-trn2/serve-dp1-tp64.env	Add explicit `DATA_PARALLEL_SIZE` to serve config (trn2).
benchmark/vllm/llama4-Maverick-trn2/vllm-results.csv	Add benchmark results dataset (trn2).
benchmark/vllm/llama4-Maverick-trn2/serve-dp1-tp64.env	Add explicit `DATA_PARALLEL_SIZE` to serve config (trn2).
benchmark/vllm/llama3-70B-trn2/serve-dp2-tp32.env	Add serve configuration for DP=2 / TP=32.
benchmark/vllm/llama-3.1-8b/vllm-results-dp4.csv	Add DP=4 benchmark results dataset.
benchmark/vllm/llama-3.1-8b/vllm-results-dp3.csv	Add DP=3 benchmark results dataset.
benchmark/vllm/llama-3.1-8b/vllm-results-cli-dp.csv	Add CLI-based DP benchmark results dataset.
benchmark/vllm/llama-3.1-8b/serve-dp3-tp8.env	Add serve configuration for DP scaling.
benchmark/vllm/llama-3.1-8b/serve-dp2-tp8.env	Add serve configuration for DP scaling.
benchmark/vllm/llama-3.1-8b/serve-dp1-tp8.env	Add serve configuration for DP scaling.
benchmark/vllm/llama-3.1-8b/dp-benchmark-results.md	Add summarized DP scaling report for Llama-3.1-8B.
benchmark/vllm/llama-3.1-8b-trn2/vllm-results.csv	Add benchmark results dataset (trn2).
benchmark/vllm/llama-3.1-8b-trn2/serve-dp1-tp4.env	Add explicit `DATA_PARALLEL_SIZE` to serve config (trn2).
benchmark/vllm/data-parallel/README.md	Remove legacy docker-compose based DP benchmark instructions.
benchmark/vllm/data-parallel/qwen3-30B-A3B/nginx.conf	Remove legacy nginx LB config for docker-compose DP setup.
benchmark/vllm/data-parallel/qwen3-30B-A3B/docker-compose.yaml	Remove legacy docker-compose DP deployment.
benchmark/vllm/data-parallel/qwen3-30B-A3B/.env	Remove legacy docker-compose env file.
benchmark/vllm/data-parallel/llama3.1-8b/nginx-dp4.conf	Remove legacy nginx LB config (dp4).
benchmark/vllm/data-parallel/llama3.1-8b/nginx-dp3.conf	Remove legacy nginx LB config (dp3).
benchmark/vllm/data-parallel/llama3.1-8b/docker-compose-dp4.yaml	Remove legacy docker-compose DP deployment (dp4).
benchmark/vllm/data-parallel/llama3.1-8b/docker-compose-dp3.yaml	Remove legacy docker-compose DP deployment (dp3).
benchmark/vllm/data-parallel/llama3.1-8b/.env	Remove legacy docker-compose env file.
benchmark/vllm/data-parallel/llama3-70B-trn2/nginx.conf	Remove legacy nginx LB config for docker-compose DP setup.
benchmark/vllm/data-parallel/llama3-70B-trn2/docker-compose.yaml	Remove legacy docker-compose DP deployment.
benchmark/vllm/data-parallel/llama3-70B-trn2/.env	Remove legacy docker-compose env file.

optimum/commands/neuron/serve.py

optimum/neuron/vllm/server_manager.py

When `optimum-cli neuron serve` runs with DP=1, it previously did not set NEURON_RT_VISIBLE_CORES, causing NxD to claim all available NeuronCores on the instance instead of just the tensor_parallel_size cores needed. This blocked subsequent DP replicas or other workloads from using the remaining cores. Reuse VLLMServerManager._resolve_physical_cores() to parse the existing env var and restrict to the first TP cores. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e.sh Replace the split single-instance/ and data-parallel/ directory structure (with docker-compose + nginx configs) with a flat layout where each model has serve-dpX-tpY.env files sourced by a single serve.sh script that calls `optimum-cli neuron serve`. Add Llama 3.1 8B data-parallel benchmark results (DP=1/2/3 on inf2.48xlarge, TP=8). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When data_parallel_size > 1, the server manager spawns multiple vLLM processes that each independently export/compile the model. These processes were sharing /tmp/nxd_model/ as the compiler workdir and the same CWD, causing races on HLO temp files and shutil.rmtree collisions. Give each DP rank its own BASE_COMPILE_WORK_DIR and CWD. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Qwen3-30B-A3B-Instruct-2507 benchmarked with TP=8, SL=4096 across DP1/DP2/DP3 at BS=4 and BS=32. BS=32 is the sweet spot at 124 tok/s (DP3), while BS=64 exceeds device memory. Full results in CSV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tell vLLM each worker's DP rank so its set_device_control_env_var() correctly propagates NEURON_RT_VISIBLE_CORES to EngineCore processes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dacorvo force-pushed the data_parallel_benchmarks branch from 2ab80f3 to bd9a8a2 Compare March 19, 2026 17:59

dacorvo changed the title ~~Data parallel benchmarks~~ fix(vllm): data parallel fixes and benchmarks Mar 20, 2026

dacorvo marked this pull request as ready for review March 20, 2026 07:47

dacorvo requested review from Copilot and tengomucho March 20, 2026 07:47

Copilot started reviewing on behalf of dacorvo March 20, 2026 07:48 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

optimum/commands/neuron/serve.py Show resolved Hide resolved

optimum/neuron/vllm/server_manager.py Show resolved Hide resolved

tengomucho approved these changes Mar 23, 2026

View reviewed changes

dacorvo and others added 6 commits March 26, 2026 07:50

fix(serve): support --data_parallel_size too

802791b

fix(vllm): pass --data-parallel-size/rank to DP worker subprocesses

7add424

Tell vLLM each worker's DP rank so its set_device_control_env_var() correctly propagates NEURON_RT_VISIBLE_CORES to EngineCore processes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dacorvo force-pushed the data_parallel_benchmarks branch from bd9a8a2 to 7add424 Compare March 26, 2026 07:50

review: check number of cores when serving with DP 1

1fbd421

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vllm): data parallel fixes and benchmarks#1097

fix(vllm): data parallel fixes and benchmarks#1097
dacorvo wants to merge 7 commits intomainfrom
data_parallel_benchmarks

dacorvo commented Mar 18, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dacorvo commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Fixes

Benchmarks

Uh oh!

HuggingFaceDocBuilderDev commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dacorvo commented Mar 18, 2026 •

edited

Loading