Open
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
2ab80f3 to
bd9a8a2
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes correctness and isolation issues in the vLLM data-parallel (DP) server-manager execution path (core visibility, per-rank working directories, and DP env propagation), and refreshes/reshapes the benchmark artifacts to a flatter, unified layout with updated DP scaling results.
Changes:
- Add CLI compatibility for
--data_parallel_sizeand improve DP=1 core visibility handling inoptimum-cli neuron serve. - Improve DP subprocess isolation and DP rank propagation in the vLLM server manager (separate workdirs/CWD and pass
--data-parallel-size/--data-parallel-rank). - Flatten benchmark layout with a unified
benchmark/vllm/serve.sh, updated docs, and new/updated result/config files while removing the older docker-compose based DP benchmark setup.
Reviewed changes
Copilot reviewed 41 out of 51 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| optimum/neuron/vllm/server_manager.py | Isolate per-rank compile/CWD and pass DP size/rank flags to each vLLM subprocess. |
| optimum/commands/neuron/serve.py | Accept underscore DP flag alias and restrict visible NeuronCores for DP=1 runs. |
| benchmark/vllm/single-instance/serve.sh | Remove legacy single-instance benchmark serve script. |
| benchmark/vllm/single-instance/README.md | Remove legacy single-instance benchmark README. |
| benchmark/vllm/single-instance/qwen3-30B-A3B/.env | Remove legacy single-instance env config. |
| benchmark/vllm/single-instance/qwen3-30B-A3B-trn2/.env | Remove legacy single-instance env config (trn2). |
| benchmark/vllm/single-instance/qwen3-235B-A22B-trn2/.env | Remove legacy single-instance env config (trn2). |
| benchmark/vllm/serve.sh | Add unified serve wrapper supporting optional DP. |
| benchmark/vllm/README.md | Update benchmark instructions to the new unified layout and DP flow. |
| benchmark/vllm/qwen3-32B-trn2/vllm-results.csv | Add benchmark results dataset. |
| benchmark/vllm/qwen3-32B-trn2/serve-dp1-tp4.env | Normalize env format and add explicit DATA_PARALLEL_SIZE. |
| benchmark/vllm/qwen3-30B-A3B/vllm-results.csv | Add benchmark results dataset. |
| benchmark/vllm/qwen3-30B-A3B/serve-dp2-tp16.env | Add serve configuration for DP/TP combo. |
| benchmark/vllm/qwen3-30B-A3B/serve-dp1-tp32.env | Add serve configuration for DP/TP combo. |
| benchmark/vllm/qwen3-30B-A3B-trn2/vllm-results.csv | Add benchmark results dataset (trn2). |
| benchmark/vllm/qwen3-30B-A3B-trn2/serve-dp1-tp4.env | Add serve configuration (trn2). |
| benchmark/vllm/qwen3-30B-A3B-2507/vllm-results.csv | Add DP scaling benchmark dataset (inf2). |
| benchmark/vllm/qwen3-30B-A3B-2507/serve-dp3-tp8.env | Add serve configuration for DP scaling. |
| benchmark/vllm/qwen3-30B-A3B-2507/serve-dp2-tp8.env | Add serve configuration for DP scaling. |
| benchmark/vllm/qwen3-30B-A3B-2507/serve-dp1-tp8.env | Add serve configuration for DP scaling. |
| benchmark/vllm/qwen3-30B-A3B-2507/README.md | Add narrative DP scaling report for Qwen3-30B-A3B on inf2.48xlarge. |
| benchmark/vllm/qwen3-235B-A22B-trn2/vllm-results.csv | Add benchmark results dataset (trn2). |
| benchmark/vllm/qwen3-235B-A22B-trn2/serve-dp1-tp64.env | Add serve configuration (trn2). |
| benchmark/vllm/llama4-Scout/vllm-results.csv | Add benchmark results dataset. |
| benchmark/vllm/llama4-Scout/serve-dp1-tp32.env | Add explicit DATA_PARALLEL_SIZE to serve config. |
| benchmark/vllm/llama4-Scout-trn2/vllm-results.csv | Add benchmark results dataset (trn2). |
| benchmark/vllm/llama4-Scout-trn2/serve-dp1-tp64.env | Add explicit DATA_PARALLEL_SIZE to serve config (trn2). |
| benchmark/vllm/llama4-Maverick-trn2/vllm-results.csv | Add benchmark results dataset (trn2). |
| benchmark/vllm/llama4-Maverick-trn2/serve-dp1-tp64.env | Add explicit DATA_PARALLEL_SIZE to serve config (trn2). |
| benchmark/vllm/llama3-70B-trn2/serve-dp2-tp32.env | Add serve configuration for DP=2 / TP=32. |
| benchmark/vllm/llama-3.1-8b/vllm-results-dp4.csv | Add DP=4 benchmark results dataset. |
| benchmark/vllm/llama-3.1-8b/vllm-results-dp3.csv | Add DP=3 benchmark results dataset. |
| benchmark/vllm/llama-3.1-8b/vllm-results-cli-dp.csv | Add CLI-based DP benchmark results dataset. |
| benchmark/vllm/llama-3.1-8b/serve-dp3-tp8.env | Add serve configuration for DP scaling. |
| benchmark/vllm/llama-3.1-8b/serve-dp2-tp8.env | Add serve configuration for DP scaling. |
| benchmark/vllm/llama-3.1-8b/serve-dp1-tp8.env | Add serve configuration for DP scaling. |
| benchmark/vllm/llama-3.1-8b/dp-benchmark-results.md | Add summarized DP scaling report for Llama-3.1-8B. |
| benchmark/vllm/llama-3.1-8b-trn2/vllm-results.csv | Add benchmark results dataset (trn2). |
| benchmark/vllm/llama-3.1-8b-trn2/serve-dp1-tp4.env | Add explicit DATA_PARALLEL_SIZE to serve config (trn2). |
| benchmark/vllm/data-parallel/README.md | Remove legacy docker-compose based DP benchmark instructions. |
| benchmark/vllm/data-parallel/qwen3-30B-A3B/nginx.conf | Remove legacy nginx LB config for docker-compose DP setup. |
| benchmark/vllm/data-parallel/qwen3-30B-A3B/docker-compose.yaml | Remove legacy docker-compose DP deployment. |
| benchmark/vllm/data-parallel/qwen3-30B-A3B/.env | Remove legacy docker-compose env file. |
| benchmark/vllm/data-parallel/llama3.1-8b/nginx-dp4.conf | Remove legacy nginx LB config (dp4). |
| benchmark/vllm/data-parallel/llama3.1-8b/nginx-dp3.conf | Remove legacy nginx LB config (dp3). |
| benchmark/vllm/data-parallel/llama3.1-8b/docker-compose-dp4.yaml | Remove legacy docker-compose DP deployment (dp4). |
| benchmark/vllm/data-parallel/llama3.1-8b/docker-compose-dp3.yaml | Remove legacy docker-compose DP deployment (dp3). |
| benchmark/vllm/data-parallel/llama3.1-8b/.env | Remove legacy docker-compose env file. |
| benchmark/vllm/data-parallel/llama3-70B-trn2/nginx.conf | Remove legacy nginx LB config for docker-compose DP setup. |
| benchmark/vllm/data-parallel/llama3-70B-trn2/docker-compose.yaml | Remove legacy docker-compose DP deployment. |
| benchmark/vllm/data-parallel/llama3-70B-trn2/.env | Remove legacy docker-compose env file. |
tengomucho
approved these changes
Mar 23, 2026
When `optimum-cli neuron serve` runs with DP=1, it previously did not set NEURON_RT_VISIBLE_CORES, causing NxD to claim all available NeuronCores on the instance instead of just the tensor_parallel_size cores needed. This blocked subsequent DP replicas or other workloads from using the remaining cores. Reuse VLLMServerManager._resolve_physical_cores() to parse the existing env var and restrict to the first TP cores. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e.sh Replace the split single-instance/ and data-parallel/ directory structure (with docker-compose + nginx configs) with a flat layout where each model has serve-dpX-tpY.env files sourced by a single serve.sh script that calls `optimum-cli neuron serve`. Add Llama 3.1 8B data-parallel benchmark results (DP=1/2/3 on inf2.48xlarge, TP=8). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When data_parallel_size > 1, the server manager spawns multiple vLLM processes that each independently export/compile the model. These processes were sharing /tmp/nxd_model/ as the compiler workdir and the same CWD, causing races on HLO temp files and shutil.rmtree collisions. Give each DP rank its own BASE_COMPILE_WORK_DIR and CWD. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3-30B-A3B-Instruct-2507 benchmarked with TP=8, SL=4096 across DP1/DP2/DP3 at BS=4 and BS=32. BS=32 is the sweet spot at 124 tok/s (DP3), while BS=64 exceeds device memory. Full results in CSV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tell vLLM each worker's DP rank so its set_device_control_env_var() correctly propagates NEURON_RT_VISIBLE_CORES to EngineCore processes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bd9a8a2 to
7add424
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes several issues with the vLLM data parallel (DP) server-manager path and updates benchmarks.
Fixes
serve.py: accept--data_parallel_size(underscore form) to match other CLI args like--tensor_parallel_sizeserver_manager.py: restrictNEURON_RT_VISIBLE_CORESin DP=1 path so single-worker doesn't claim all coresserver_manager.py: isolate DP worker workdirs (BASE_COMPILE_WORK_DIR, CWD) to prevent compile races on temp filesserver_manager.py: pass--data-parallel-sizeand--data-parallel-rankto each vLLM subprocess so vLLM'sset_device_control_env_var()correctly propagatesNEURON_RT_VISIBLE_CORESto EngineCore processesBenchmarks
serve.sh