feat: Fix vLLM placement group conflicts in Ray clusters and add local mode… by ffrujeri · Pull Request #669 · NVIDIA-NeMo/Gym

ffrujeri · 2026-02-11T01:14:00Z

What does this PR do?

Fixes vLLM placement group conflicts in Ray clusters and adds support for local model paths.

This PR patches vLLM's v1 engine to handle multiple Ray placement groups, preventing crashes when running multiple vLLM instances in the same cluster. It also adds support for using locally stored models instead of always downloading from HuggingFace.

Issues

List issues that this PR closes (syntax):

Fixes #148

Changes in this PR

vLLM Placement Group Patch: _patch_vllm_placement_group_filter()
- Filters out placement group node resource keys (e.g., node:IP_group_N_hash)
- Allows multiple vLLM instances to coexist in the same Ray cluster
- Adds unique timestamp-based suffixes to avoid placement group name conflicts
- Includes comprehensive logging for debugging placement group creation
Local Model Support: Modified download_model()
- Checks if model path exists locally before attempting HuggingFace download
- Skips download step for local models, improving startup time
Debugging Infrastructure: _cleanup_stale_placement_groups()
- Logs existing placement groups when debug=True
- Helps diagnose placement group conflicts and resource allocation issues
Import Updates
- Added list_placement_groups from ray.util.state for placement group inspection

…l support. Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

copy-pr-bot · 2026-02-11T01:14:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Fix vLLM placement group conflicts in Ray clusters and add local mode…

23b39e8

…l support. Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Add diff of the patch comparing to the original version.

b5cb812

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

bxyu-nvidia approved these changes Feb 11, 2026

View reviewed changes

ffrujeri changed the title ~~Fix vLLM placement group conflicts in Ray clusters and add local mode…~~ feat: Fix vLLM placement group conflicts in Ray clusters and add local mode… Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Fix vLLM placement group conflicts in Ray clusters and add local mode…#669

feat: Fix vLLM placement group conflicts in Ray clusters and add local mode…#669
ffrujeri wants to merge 2 commits intomainfrom
ffrujeri/local-vllm-model-placement-groups-fix

ffrujeri commented Feb 11, 2026

Uh oh!

copy-pr-bot bot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ffrujeri commented Feb 11, 2026

What does this PR do?

Issues

Changes in this PR

Uh oh!

copy-pr-bot bot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants