Skip to content

[bugfix] fix hang issue when fed empty batch#342

Open
gameofdimension wants to merge 1 commit intoNVIDIA:mainfrom
gameofdimension:patch-3
Open

[bugfix] fix hang issue when fed empty batch#342
gameofdimension wants to merge 1 commit intoNVIDIA:mainfrom
gameofdimension:patch-3

Conversation

@gameofdimension
Copy link
Copy Markdown
Contributor

fix issue #341

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Removed handling for empty input in embedding processing.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR fixes a hang issue that occurred when an empty batch was fed during distributed training. The fix removes an early-return path in _dedup_indices() that was added to handle empty input but was actually causing NCCL collective deadlocks.

Key changes:

  • Removes the special-case empty-batch branch inside ShardedDynamicEmbeddingCollection._dedup_indices()
  • All three underlying CUDA kernels (expand_table_ids_cuda, segmented_unique_cuda, compute_dedup_lengths_cuda) already contain their own empty-input guards at the C++ level, so correctness is preserved

Why the old code was wrong: In a data-parallel/model-parallel setup, individual ranks can receive empty batches while others do not. The old early-return path caused asymmetric execution relative to the NCCL collectives in input_dist, causing a deadlock. By removing it and letting all ranks execute the same code path through the CUDA kernels (which handle empty inputs gracefully internally), all collective operations remain symmetric.

Additional fix: The old empty path was creating reverse_indices with dtype=torch.uint64, while the normal non-empty path produces int64 from segmented_unique_cuda. Removing the branch also eliminates this silent dtype inconsistency.

Confidence Score: 5/5

This PR is safe to merge — it removes a faulty early-return path and the underlying CUDA kernels already handle empty inputs correctly.

The change is minimal (15 lines removed, 0 added). All three CUDA kernels have verified empty-input guards at the C++ level. The fix correctly unifies the code path across all ranks in distributed training, eliminating the NCCL hang. No new logic is introduced.

No files require special attention. The single changed file has a straightforward deletion and no new logic.

Important Files Changed

Filename Overview
corelib/dynamicemb/dynamicemb/shard/embedding.py Removes the premature empty-batch short-circuit in _dedup_indices(); all downstream CUDA kernels already handle empty tensors safely, and the fix eliminates asymmetric collective execution in distributed training.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_dedup_indices called] --> B[iterate input_feature_splits]
    B --> C[compute num_elements]
    C --> OLD_CHECK
    subgraph OLD [Old Code - Removed]
        OLD_CHECK{num_elements == 0?}
        OLD_CHECK -->|yes| OLD_EARLY[Build KJT with lengths and offsets\nAppend empty uint64 reverse_idx\nSkip CUDA kernels\nCauses NCCL hang in dist training]
        OLD_CHECK -->|no| OLD_NORMAL[Normal CUDA path]
    end
    C --> NEW_PATH
    subgraph NEW [New Code - Current]
        NEW_PATH[expand_table_ids_cuda\nhandles empty internally]
        NEW_PATH --> NEW_SEG[segmented_unique_cuda\nhandles empty internally]
        NEW_SEG --> NEW_LEN[compute_dedup_lengths_cuda\nhandles empty internally]
        NEW_LEN --> NEW_KJT[Build KJT from unique_keys\nAppend int64 reverse_idx]
    end
    NEW_KJT --> DONE[All ranks follow same path\nNCCL collectives succeed]
Loading

Reviews (1): Last reviewed commit: "Remove empty input handling in embedding..." | Re-trigger Greptile

@gameofdimension
Copy link
Copy Markdown
Contributor Author

@shijieliu please take a look

@shijieliu
Copy link
Copy Markdown
Collaborator

hi @gameofdimension thanks for your contribution! We will try to reproduce this issue first and merge after verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants