feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357
feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357aoshen524 wants to merge 8 commits intoalibaba:mainfrom
Conversation
…es SP ranks Distribute whole images across Ulysses SP ranks for parallelized ViT computation, reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction). Key changes: - Add roll/utils/context_parallel/vision_dp.py with image distribution utilities, GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper - Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-VL-MoE VisionTransformer classes - Integrate into DeepSpeed strategy (both inference and training workers) - Add 17 unit tests covering all utility functions, edge cases, and integration workflows Ported from verl (verl-project/verl#5230). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…issues Address reviewer comments (same fixes as verl PR #5230 and AReaL PR #929): 1. **Gradient routing fix (critical)**: Replace `grad_scaler * dp_size` with `all_reduce(SUM)` in GatherVisionEmbeddings.backward() to aggregate partial sequence gradients before slicing. Fixes silent gradient loss when vision tokens span multiple sequence shard boundaries. 2. **Load-balanced assignment**: Replace count-based chunking with greedy contiguous bin-packing that balances total patch load across ranks. 3. **Remove unnecessary all_gather**: Pass pre-computed `all_counts` from caller instead of doing all_gather in forward. 4. **Idempotency guard**: Extract `_patch_vision_class()` helper with `_vision_dp_patched` attribute check. Add `_unapply_vision_class()` to properly clear the flag on unapply. 5. **Remove Qwen3-VL-MoE dead code**: Remove unreachable qwen3_vl_moe blocks from apply/unapply (not yet in transformers vl_model_mappings). 6. **GPU→CPU sync optimization**: Move `grid_thw.cpu()` to dp_vision_forward entry point to avoid repeated `.tolist()` GPU→CPU syncs. 7. **Tensor slicing**: Replace Python loop + list append in prepare_local_vision_inputs with contiguous tensor slice using cumsum. 8. **Test improvements**: Rename tests, add load balancing test, add gather_none_group test, use parametrize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8227ee1 to
1b13eaf
Compare
…d contiguous guard - Trim verbose docstrings to concise one-liners - Delete dead store ctx.hidden_size (written in forward, never read in backward) - Simplify hidden_size detection: self.config.out_hidden_size - Add requires_grad_() for empty rank to participate in backward all_reduce - Add .contiguous() guard before all_reduce (NCCL requirement) - Reuse get_image_patch_counts in spatial_merge_size==1 path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace isinstance(tuple) check with model attribute detection (hasattr deepstack_merger_list). Empty ranks now create matching empty deepstack tensors and participate in all-gather, preventing NCCL deadlock when num_images < dp_size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `vision_dp: bool = False` to ModelArguments and gate apply_vision_dp_patch() calls in both DeepSpeedInferStrategy and DeepSpeedTrainStrategy behind it. Vision DP is now opt-in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace `expected_patches = end_patch - start_patch` (always-true by Python slicing) with independent cross-check via `get_image_patch_counts(local_grid_thw)` in prepare_local_vision_inputs() - Rename tests to `test_<what>_<condition>_<expected>()` convention - Add missing tests: embedding_counts empty, contiguous coverage, gather same-storage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sync shared utility functions with verl's stricter error handling: - get_image_patch_counts/get_image_embedding_counts: empty grid_thw raises ValueError instead of returning [] - assign_images_to_dp_ranks: validate dp_size > 0, empty patch_counts raises ValueError instead of returning empty lists - prepare_local_vision_inputs: add dp_rank bounds check, use tensor-ops for offset computation (avoid Python-list round-trip), add int() cast - GatherVisionEmbeddings.forward: dp_size<=1 raises RuntimeError, validate all_counts length, max_count==0 raises RuntimeError - GatherVisionEmbeddings.backward: assert dp_size>1, add CUDA check - dp_vision_forward: sp_size<=1 raises RuntimeError, use GatherVisionEmbeddings.apply() directly, add detailed assert messages - Update tests to match: empty→raises, add dp_size/dp_rank validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
guoshengCS
left a comment
There was a problem hiding this comment.
Thanks for the good job!
| current_platform.apply_ulysses_patch() | ||
| set_upg_manager(ulysses_size=cp_size, rank=global_rank, world_size=world_size) | ||
| if self.worker_config.model_args.vision_dp: | ||
| apply_vision_dp_patch() |
There was a problem hiding this comment.
It seems vision_dp also suit to fsdp strategy in the same way andapply_vision_dp_patch have to be called manually since not included in apply_ulysses_patch , could you please support it in fsdp_strategy too
Call apply_vision_dp_patch() in fsdp2_strategy.py after set_upg_manager(), mirroring the existing pattern in deepspeed_strategy.py. This ensures Vision DP works correctly with FSDP2, not just DeepSpeed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@aoshen524 Could you please provide training curves with and without vision DP |
| local_embeddings = original_forward(self, local_pixels, local_grid_thw, **kwargs) | ||
| else: | ||
| # This rank has no images, create empty tensor with correct hidden size | ||
| hidden_size = getattr(getattr(self, "config", None), "out_hidden_size", None) |
There was a problem hiding this comment.
Qwen2VL uses config.hidden_size as its vision-language hidden dimension.
We should add support for this accordingly.
https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct/blob/main/config.json
https://github.com/huggingface/transformers/blob/8cb5963cc22174954e7dca2c0a3320b7dc2f4edc/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L662
| (0, h), dtype=hidden_states.dtype, device=hidden_states.device | ||
| ) | ||
| for _ in range(num_deepstack) | ||
| ] |
There was a problem hiding this comment.
Why doesn't the empty-rank local_deepstack path also call requires_grad_(), similar to empty local_embeddings? Since each deepstack tensor is also passed through GatherVisionEmbeddings, it seems those empty tensors should still participate in autograd so every rank enters the same backward all_reduce. Otherwise, could empty ranks skip the custom backward for deepstack and risk a collective mismatch or hang?
Vision Data Parallel: Distribute ViT computation across Ulysses SP ranks
Ported from verl PR #5230, adapted for ROLL's Ulysses SP infrastructure.
Motivation
When using Ulysses Sequence Parallelism (sp_size > 1), the VisionTransformer still processes ALL images on every rank, wasting memory. Vision DP distributes whole images across SP ranks, reducing ViT memory by ~sp_size x.
Key changes
roll/utils/context_parallel/vision_dp.pyroll/utils/context_parallel/monkey_patch.pytests/utils/test_vision_dp_on_cpu.pyTests
python -m pytest tests/utils/test_vision_dp_on_cpu.py -v # 28 passed