[bugfix] fix hang issue when fed empty batch by gameofdimension · Pull Request #342 · NVIDIA/recsys-examples

gameofdimension · 2026-04-03T06:22:59Z

fix issue #341

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Removed handling for empty input in embedding processing.

greptile-apps · 2026-04-03T06:25:33Z

Greptile Summary

This PR fixes a hang issue that occurred when an empty batch was fed during distributed training. The fix removes an early-return path in _dedup_indices() that was added to handle empty input but was actually causing NCCL collective deadlocks.

Key changes:

Removes the special-case empty-batch branch inside ShardedDynamicEmbeddingCollection._dedup_indices()
All three underlying CUDA kernels (expand_table_ids_cuda, segmented_unique_cuda, compute_dedup_lengths_cuda) already contain their own empty-input guards at the C++ level, so correctness is preserved

Why the old code was wrong: In a data-parallel/model-parallel setup, individual ranks can receive empty batches while others do not. The old early-return path caused asymmetric execution relative to the NCCL collectives in input_dist, causing a deadlock. By removing it and letting all ranks execute the same code path through the CUDA kernels (which handle empty inputs gracefully internally), all collective operations remain symmetric.

Additional fix: The old empty path was creating reverse_indices with dtype=torch.uint64, while the normal non-empty path produces int64 from segmented_unique_cuda. Removing the branch also eliminates this silent dtype inconsistency.

Confidence Score: 5/5

This PR is safe to merge — it removes a faulty early-return path and the underlying CUDA kernels already handle empty inputs correctly.

The change is minimal (15 lines removed, 0 added). All three CUDA kernels have verified empty-input guards at the C++ level. The fix correctly unifies the code path across all ranks in distributed training, eliminating the NCCL hang. No new logic is introduced.

No files require special attention. The single changed file has a straightforward deletion and no new logic.

Important Files Changed

Filename	Overview
corelib/dynamicemb/dynamicemb/shard/embedding.py	Removes the premature empty-batch short-circuit in `_dedup_indices()`; all downstream CUDA kernels already handle empty tensors safely, and the fix eliminates asymmetric collective execution in distributed training.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_dedup_indices called] --> B[iterate input_feature_splits]
    B --> C[compute num_elements]
    C --> OLD_CHECK
    subgraph OLD [Old Code - Removed]
        OLD_CHECK{num_elements == 0?}
        OLD_CHECK -->|yes| OLD_EARLY[Build KJT with lengths and offsets\nAppend empty uint64 reverse_idx\nSkip CUDA kernels\nCauses NCCL hang in dist training]
        OLD_CHECK -->|no| OLD_NORMAL[Normal CUDA path]
    end
    C --> NEW_PATH
    subgraph NEW [New Code - Current]
        NEW_PATH[expand_table_ids_cuda\nhandles empty internally]
        NEW_PATH --> NEW_SEG[segmented_unique_cuda\nhandles empty internally]
        NEW_SEG --> NEW_LEN[compute_dedup_lengths_cuda\nhandles empty internally]
        NEW_LEN --> NEW_KJT[Build KJT from unique_keys\nAppend int64 reverse_idx]
    end
    NEW_KJT --> DONE[All ranks follow same path\nNCCL collectives succeed]

_{Reviews (1): Last reviewed commit: "Remove empty input handling in embedding..." | Re-trigger Greptile}

gameofdimension · 2026-04-03T06:35:57Z

@shijieliu please take a look

shijieliu · 2026-04-07T01:31:07Z

hi @gameofdimension thanks for your contribution! We will try to reproduce this issue first and merge after verify.

Remove empty input handling in embedding.py

5b698d0

Removed handling for empty input in embedding processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] fix hang issue when fed empty batch#342

[bugfix] fix hang issue when fed empty batch#342
gameofdimension wants to merge 1 commit intoNVIDIA:mainfrom
gameofdimension:patch-3

gameofdimension commented Apr 3, 2026

Uh oh!

greptile-apps bot commented Apr 3, 2026

Uh oh!

gameofdimension commented Apr 3, 2026

Uh oh!

shijieliu commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gameofdimension commented Apr 3, 2026

Description

Checklist

Uh oh!

greptile-apps bot commented Apr 3, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

gameofdimension commented Apr 3, 2026

Uh oh!

shijieliu commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants