[bugfix] fix hang issue when fed empty batch#342
[bugfix] fix hang issue when fed empty batch#342gameofdimension wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Removed handling for empty input in embedding processing.
Greptile SummaryThis PR fixes a hang issue that occurred when an empty batch was fed during distributed training. The fix removes an early-return path in Key changes:
Why the old code was wrong: In a data-parallel/model-parallel setup, individual ranks can receive empty batches while others do not. The old early-return path caused asymmetric execution relative to the NCCL collectives in Additional fix: The old empty path was creating Confidence Score: 5/5This PR is safe to merge — it removes a faulty early-return path and the underlying CUDA kernels already handle empty inputs correctly. The change is minimal (15 lines removed, 0 added). All three CUDA kernels have verified empty-input guards at the C++ level. The fix correctly unifies the code path across all ranks in distributed training, eliminating the NCCL hang. No new logic is introduced. No files require special attention. The single changed file has a straightforward deletion and no new logic. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[_dedup_indices called] --> B[iterate input_feature_splits]
B --> C[compute num_elements]
C --> OLD_CHECK
subgraph OLD [Old Code - Removed]
OLD_CHECK{num_elements == 0?}
OLD_CHECK -->|yes| OLD_EARLY[Build KJT with lengths and offsets\nAppend empty uint64 reverse_idx\nSkip CUDA kernels\nCauses NCCL hang in dist training]
OLD_CHECK -->|no| OLD_NORMAL[Normal CUDA path]
end
C --> NEW_PATH
subgraph NEW [New Code - Current]
NEW_PATH[expand_table_ids_cuda\nhandles empty internally]
NEW_PATH --> NEW_SEG[segmented_unique_cuda\nhandles empty internally]
NEW_SEG --> NEW_LEN[compute_dedup_lengths_cuda\nhandles empty internally]
NEW_LEN --> NEW_KJT[Build KJT from unique_keys\nAppend int64 reverse_idx]
end
NEW_KJT --> DONE[All ranks follow same path\nNCCL collectives succeed]
Reviews (1): Last reviewed commit: "Remove empty input handling in embedding..." | Re-trigger Greptile |
|
@shijieliu please take a look |
|
hi @gameofdimension thanks for your contribution! We will try to reproduce this issue first and merge after verify. |
fix issue #341
Description
Checklist