feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks by aoshen524 · Pull Request #357 · alibaba/ROLL

aoshen524 · 2026-02-16T03:31:34Z

Vision Data Parallel: Distribute ViT computation across Ulysses SP ranks

Ported from verl PR #5230, adapted for ROLL's Ulysses SP infrastructure.

Motivation

When using Ulysses Sequence Parallelism (sp_size > 1), the VisionTransformer still processes ALL images on every rank, wasting memory. Vision DP distributes whole images across SP ranks, reducing ViT memory by ~sp_size x.

Key changes

File	Change
`roll/utils/context_parallel/vision_dp.py`	Core utilities: load-balanced assignment, tensor slicing, all-gather with gradient fix
`roll/utils/context_parallel/monkey_patch.py`	Integration with idempotency guard, clean unapply support
`tests/utils/test_vision_dp_on_cpu.py`	CPU-only unit tests (28 tests)

Tests

python -m pytest tests/utils/test_vision_dp_on_cpu.py -v
# 28 passed

CLAassistant · 2026-02-16T03:31:41Z

All committers have signed the CLA.

…es SP ranks Distribute whole images across Ulysses SP ranks for parallelized ViT computation, reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction). Key changes: - Add roll/utils/context_parallel/vision_dp.py with image distribution utilities, GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper - Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-VL-MoE VisionTransformer classes - Integrate into DeepSpeed strategy (both inference and training workers) - Add 17 unit tests covering all utility functions, edge cases, and integration workflows Ported from verl (verl-project/verl#5230). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…issues Address reviewer comments (same fixes as verl PR #5230 and AReaL PR #929): 1. **Gradient routing fix (critical)**: Replace `grad_scaler * dp_size` with `all_reduce(SUM)` in GatherVisionEmbeddings.backward() to aggregate partial sequence gradients before slicing. Fixes silent gradient loss when vision tokens span multiple sequence shard boundaries. 2. **Load-balanced assignment**: Replace count-based chunking with greedy contiguous bin-packing that balances total patch load across ranks. 3. **Remove unnecessary all_gather**: Pass pre-computed `all_counts` from caller instead of doing all_gather in forward. 4. **Idempotency guard**: Extract `_patch_vision_class()` helper with `_vision_dp_patched` attribute check. Add `_unapply_vision_class()` to properly clear the flag on unapply. 5. **Remove Qwen3-VL-MoE dead code**: Remove unreachable qwen3_vl_moe blocks from apply/unapply (not yet in transformers vl_model_mappings). 6. **GPU→CPU sync optimization**: Move `grid_thw.cpu()` to dp_vision_forward entry point to avoid repeated `.tolist()` GPU→CPU syncs. 7. **Tensor slicing**: Replace Python loop + list append in prepare_local_vision_inputs with contiguous tensor slice using cumsum. 8. **Test improvements**: Rename tests, add load balancing test, add gather_none_group test, use parametrize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…d contiguous guard - Trim verbose docstrings to concise one-liners - Delete dead store ctx.hidden_size (written in forward, never read in backward) - Simplify hidden_size detection: self.config.out_hidden_size - Add requires_grad_() for empty rank to participate in backward all_reduce - Add .contiguous() guard before all_reduce (NCCL requirement) - Reuse get_image_patch_counts in spatial_merge_size==1 path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace isinstance(tuple) check with model attribute detection (hasattr deepstack_merger_list). Empty ranks now create matching empty deepstack tensors and participate in all-gather, preventing NCCL deadlock when num_images < dp_size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add `vision_dp: bool = False` to ModelArguments and gate apply_vision_dp_patch() calls in both DeepSpeedInferStrategy and DeepSpeedTrainStrategy behind it. Vision DP is now opt-in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace `expected_patches = end_patch - start_patch` (always-true by Python slicing) with independent cross-check via `get_image_patch_counts(local_grid_thw)` in prepare_local_vision_inputs() - Rename tests to `test_<what>_<condition>_<expected>()` convention - Add missing tests: embedding_counts empty, contiguous coverage, gather same-storage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Sync shared utility functions with verl's stricter error handling: - get_image_patch_counts/get_image_embedding_counts: empty grid_thw raises ValueError instead of returning [] - assign_images_to_dp_ranks: validate dp_size > 0, empty patch_counts raises ValueError instead of returning empty lists - prepare_local_vision_inputs: add dp_rank bounds check, use tensor-ops for offset computation (avoid Python-list round-trip), add int() cast - GatherVisionEmbeddings.forward: dp_size<=1 raises RuntimeError, validate all_counts length, max_count==0 raises RuntimeError - GatherVisionEmbeddings.backward: assert dp_size>1, add CUDA check - dp_vision_forward: sp_size<=1 raises RuntimeError, use GatherVisionEmbeddings.apply() directly, add detailed assert messages - Update tests to match: empty→raises, add dp_size/dp_rank validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

guoshengCS

Thanks for the good job!

guoshengCS · 2026-03-05T04:03:32Z

roll/distributed/strategy/deepspeed_strategy.py

            current_platform.apply_ulysses_patch()
            set_upg_manager(ulysses_size=cp_size, rank=global_rank, world_size=world_size)
+            if self.worker_config.model_args.vision_dp:
+                apply_vision_dp_patch()


It seems vision_dp also suit to fsdp strategy in the same way andapply_vision_dp_patch have to be called manually since not included in apply_ulysses_patch , could you please support it in fsdp_strategy too

sure, done.

Call apply_vision_dp_patch() in fsdp2_strategy.py after set_upg_manager(), mirroring the existing pattern in deepspeed_strategy.py. This ensures Vision DP works correctly with FSDP2, not just DeepSpeed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

guoshengCS · 2026-03-09T07:15:59Z

@aoshen524 Could you please provide training curves with and without vision DP

pUmpKin-Co · 2026-03-18T07:12:39Z

roll/utils/context_parallel/vision_dp.py

+            local_embeddings = original_forward(self, local_pixels, local_grid_thw, **kwargs)
+        else:
+            # This rank has no images, create empty tensor with correct hidden size
+            hidden_size = getattr(getattr(self, "config", None), "out_hidden_size", None)


Qwen2VL uses config.hidden_size as its vision-language hidden dimension.
We should add support for this accordingly.
https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct/blob/main/config.json
https://github.com/huggingface/transformers/blob/8cb5963cc22174954e7dca2c0a3320b7dc2f4edc/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L662

pUmpKin-Co · 2026-03-18T07:18:07Z

roll/utils/context_parallel/vision_dp.py

+                        (0, h), dtype=hidden_states.dtype, device=hidden_states.device
+                    )
+                    for _ in range(num_deepstack)
+                ]


Why doesn't the empty-rank local_deepstack path also call requires_grad_(), similar to empty local_embeddings? Since each deepstack tensor is also passed through GatherVisionEmbeddings, it seems those empty tensors should still participate in autograd so every rank enters the same backward all_reduce. Otherwise, could empty ranks skip the custom backward for deepstack and risk a collective mismatch or hang?

PanAndy requested a review from chocoded February 26, 2026 09:12

aoshen524 and others added 2 commits February 26, 2026 18:24

aoshen524 force-pushed the feat/vision-dp-ulysses branch from 8227ee1 to 1b13eaf Compare February 26, 2026 09:25

aoshen524 and others added 5 commits March 3, 2026 22:47

guoshengCS reviewed Mar 5, 2026

View reviewed changes

pUmpKin-Co reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357

feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357
aoshen524 wants to merge 8 commits intoalibaba:mainfrom
aoshen524:feat/vision-dp-ulysses

aoshen524 commented Feb 16, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Feb 16, 2026 •

edited

Loading

Uh oh!

guoshengCS left a comment

Uh oh!

guoshengCS Mar 5, 2026

Uh oh!

aoshen524 Mar 5, 2026 •

edited

Loading

Uh oh!

guoshengCS commented Mar 9, 2026

Uh oh!

pUmpKin-Co Mar 18, 2026

Uh oh!

pUmpKin-Co Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aoshen524 commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vision Data Parallel: Distribute ViT computation across Ulysses SP ranks

Motivation

Key changes

Tests

Uh oh!

CLAassistant commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoshengCS left a comment

Choose a reason for hiding this comment

Uh oh!

guoshengCS Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

aoshen524 Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guoshengCS commented Mar 9, 2026

Uh oh!

pUmpKin-Co Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

pUmpKin-Co Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aoshen524 commented Feb 16, 2026 •

edited

Loading

CLAassistant commented Feb 16, 2026 •

edited

Loading

aoshen524 Mar 5, 2026 •

edited

Loading