[Draft] Fea dynamicemb table fusion by shijieliu · Pull Request #313 · NVIDIA/recsys-examples

shijieliu · 2026-02-26T15:31:44Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

greptile-apps · 2026-02-26T15:36:54Z

Greptile Summary

This PR introduces a table-fusion refactor of the DynamicEmbedding library, consolidating what was previously a per-table object model (List[DynamicEmbeddingTable]) into unified fused objects (DynamicEmbStorage, DynamicEmbCache, HybridStorage). The key architectural changes are:

Fused hash table: A single DynamicEmbTableState backed by a multi-table ScoredHashTable replaces N separate tables. All CUDA kernels now operate on flat table pointers keyed by per-row table_ids.
Unified prefetch pipeline: dynamicemb_prefetch now returns a typed PrefetchState dataclass queued in self._prefetch_states and consumed by forward(), replacing the old num_prefetch_ahead counter.
Admission counters merged: Multiple KVCounter configs are wrapped into one MultiTableKVCounter backed by a single fused hash table.
Storage tier unification: Three storage modes (CACHE, HBM_DIRECT, DEFAULT) replace the previous cache/storage dual-list design.

Two P1 bugs and one P2 issue were identified:

flush() does not drain _prefetch_states: After a cache flush, any queued PrefetchState objects hold stale slot indices. The next forward() call pops that stale state, silently producing wrong embeddings.
CUSTOMIZED score guard removed: _create_score now initialises self._scores[name] = 0 for CUSTOMIZED tables, silently bypassing the RuntimeError guard in forward() that requires users to call set_score() first.
External storage class taken from table 0 only: In caching mode, PS = storage_options[0].external_storage ignores external_storage configured on tables 1..N without warning.

Confidence Score: 3/5

Not safe to merge as-is: two P1 bugs can cause silent data corruption (stale prefetch states after flush, CUSTOMIZED score guard bypassed).

Two confirmed P1 issues remain: (1) flush() does not clear _prefetch_states, meaning slot indices produced before a cache flush will be consumed after it — silently returning wrong embeddings. (2) _create_score now pre-populates self._scores[name] = 0 for CUSTOMIZED tables, so the RuntimeError guard that required users to call set_score() first is permanently bypassed. A P2 external-storage selection issue and an assertion-vs-ValueError nit round out the findings. The overall fusion architecture is sound and test coverage is good, but the two P1 correctness regressions need to be resolved before merging.

batched_dynamicemb_tables.py (flush and _create_score); key_value_table.py (external_storage selection)

Important Files Changed

Filename	Overview
corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py	Central orchestration layer refactored to fused storage/cache objects; two P1 bugs: flush() does not drain _prefetch_states, and CUSTOMIZED score tables bypass the set_score() enforcement guard.
corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py	Major rewrite: dynamicemb_prefetch now returns a typed PrefetchState; CACHE/HBM_DIRECT/DEFAULT storage modes handled; admission and eviction logic unified; backward uses fused flat-table update.
corelib/dynamicemb/dynamicemb/key_value_table.py	DynamicEmbeddingTable replaced by DynamicEmbTableState dataclass + free functions; DynamicEmbStorage, DynamicEmbCache, HybridStorage introduced; external_storage class selection uses only table 0's config (P2).
corelib/dynamicemb/dynamicemb/types.py	Abstract interfaces updated: Storage.find gains table_ids/CopyMode; Cache.lookup/insert_and_evict simplified; Counter methods gain table_ids; CopyMode enum added. Clean interface evolution.
corelib/dynamicemb/dynamicemb/embedding_admission.py	KVCounter demoted to config-only dataclass; MultiTableKVCounter backed by a fused scored hash table introduced; add() now uses score_out for accumulated frequency.
corelib/dynamicemb/dynamicemb/optimizer.py	Added fused_update_for_flat_table and update_for_padded_buffer for multi-table fused backward; existing per-table fused_update_with_index preserved for compatibility.
corelib/dynamicemb/dynamicemb/scored_hashtable.py	Multi-table capacity list support added to get_scored_table and LinearBucketTable; per_table_capacity_ and overflow_bucket_capacity_ properties exposed.
corelib/dynamicemb/test/test_batched_dynamic_embedding_tables_v2.py	Tests updated to fused API (var.tables, var.cache); new multi-table fusion scenarios covered; good coverage of new storage modes.

Sequence Diagram

sequenceDiagram
    participant User
    participant BDET as BatchedDynamicEmbTablesV2
    participant PrefetchQ as _prefetch_states (deque)
    participant PF as dynamicemb_prefetch
    participant FWD as DynamicEmbeddingFunction
    participant Cache as DynamicEmbCache
    participant Storage as DynamicEmbStorage

    User->>BDET: prefetch(indices, offsets)
    BDET->>Cache: set_score(fused_score), training=True
    BDET->>Storage: set_score(fused_score), training=True
    BDET->>PF: dynamicemb_prefetch(...)
    PF->>Storage: segmented_unique → unique_keys, table_ids
    PF->>Cache: lookup(unique_keys, table_ids)
    Cache-->>PF: founds, slot_indices
    PF->>Storage: find(miss_keys) [cache miss path]
    PF->>Cache: insert_and_evict(miss_keys)
    Cache-->>PF: new slot_indices, evicted_keys
    PF->>Storage: insert(evicted_keys, evicted_values)
    PF-->>BDET: PrefetchState(unique_keys, slot_indices, ...)
    BDET->>PrefetchQ: append(PrefetchState)

    User->>BDET: forward(indices, offsets)
    BDET->>PrefetchQ: popleft() → PrefetchState
    BDET->>FWD: apply(prefetch_state, ...)
    FWD->>Cache: load_from_flat(slot_indices) [CACHE mode]
    FWD-->>BDET: output_embs
    Note over FWD: outstanding_keys_ref -= num_prefetched_keys

    User->>BDET: backward(grads)
    FWD->>FWD: reduce_grads → unique_grads
    FWD->>Storage: fused_update_for_flat_table(unique_grads, update_slot_indices)
    FWD->>Cache: decrement_counter(update_slot_indices)

Comments Outside Diff (2)

corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py, line 260-267 (link)

flush() leaves stale _prefetch_states in the deque

flush() resets _prefetch_outstanding_keys and flushes the cache to storage, but it does not drain self._prefetch_states. Any PrefetchState objects that were queued before flush() was called contain slot indices that refer to the pre-flush cache layout. After flush_cache() moves entries back to storage, those cached slot indices are no longer valid. The very next forward() call will pop a stale state and attempt load_from_flat on slots that may now hold different (or no) data, silently producing wrong embeddings.

The same problem occurs when switching from .train() → .eval() → .train(): the eval path never consumes the deque, so stale states accumulate for the next training phase.
corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py, line 581-583 (link)

CUSTOMIZED score guard silently bypassed

In the old _create_score, tables with CUSTOMIZED score strategy had no entry added to self._scores. This caused the guard in forward():
```
if table_name not in self._scores.keys():
    raise RuntimeError("Must set score for table ...")
```
to fire, enforcing that users must call set_score() before the first iteration.

Now self._scores[table_name] = 0 is inserted for every CUSTOMIZED table, so the guard is never triggered. A user who forgets to call set_score() will silently run every prefetch/forward with score = 0. For the CUSTOMIZED eviction strategy this means all inserted keys get the same score and eviction order becomes arbitrary / incorrect.

The fix is to omit the assignment, restoring the original enforcement:

_{Reviews (13): Last reviewed commit: "support fill_tables API for BatchedDynam..." | Re-trigger Greptile}

greptile-apps

_{37 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

jiashuy · 2026-02-27T06:48:04Z

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py

-                    training,
-                    EvictStrategy(evict_strategy.value) if evict_strategy else None,
-                    lfu_accumulated_frequency_per_table,
+            if prefetch_state is None:


What will happen when prefetch depth > 1, will. there any protect to the prefetch_state

jiashuy · 2026-02-27T07:11:17Z

corelib/dynamicemb/dynamicemb/key_value_table.py

+        host_options: List[DynamicEmbTableOptions],
+        optimizer: BaseDynamicEmbeddingOptimizer,
+    ):
+        self._hbm = create_table_state(hbm_options, optimizer)


Will there two duplicated table states for hybrid state?

I mean the key_index_map

greptile-apps · 2026-02-28T02:07:48Z

Additional Comments (5)

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 787
Dead-code / no-op statements in dynamicemb_eval_forward

Three consecutive statements are bare expressions whose values are immediately discarded:

emb_dtype = storage.embedding_dtype()
storage.max_embedding_dim()   # return value is thrown away
cache is not None              # boolean is thrown away

These look like forgotten assignments. Based on how the rest of the function uses DynamicEmbeddingFunction.forward (which does emb_dim = storage.max_embedding_dim() and caching = cache is not None), the likely intent was:

        emb_dtype = storage.embedding_dtype()
        emb_dim = storage.max_embedding_dim()
        caching = cache is not None

Even if emb_dim and caching are not subsequently needed in this function's current form, leaving bare expression statements as dead code is confusing and will trip linters/reviewers.

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 817
No-op boolean expression

dims is not None and max_D > min(dims) evaluates mixed_D but discards the result. In DynamicEmbeddingFunction.forward, the same expression is assigned to ctx.mixed_D and used for out_dim. Without the assignment here, gather_embedding_pooled always uses the full max_D-wide unique_embs regardless of whether the tables have uniform or mixed dimensions.

        mixed_D = dims is not None and max_D > min(dims)

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 232
flagged_compact called with None in the tensor list

When missing_scores is None (e.g. with LRU eviction strategy, which produces no score output), the call becomes:

flagged_compact(admit_mask, [missing_keys, missing_indices, missing_table_ids, None])

flagged_compact is a C++ extension that almost certainly expects every element of the second argument to be a torch.Tensor. Passing None would raise a TypeError at runtime.

This path is reachable any time:

admit_strategy is not None (admission control is active), and
missing_keys.numel() > 0, and
eviction strategy doesn't produce scores (e.g. KLru), making missing_scores=None

The list passed to flagged_compact should guard against None:

tensors_to_compact = [missing_keys, missing_indices, missing_table_ids]
if missing_scores is not None:
    tensors_to_compact.append(missing_scores)

_, _, compacted = flagged_compact(admit_mask, tensors_to_compact)

if missing_scores is not None:
    keys_to_insert, positions_in_unique, table_ids_to_insert, scores_to_insert = compacted
else:
    keys_to_insert, positions_in_unique, table_ids_to_insert = compacted
    scores_to_insert = None

corelib/dynamicemb/dynamicemb/key_value_table.py, line 1714
Dead-code attribute access in flush_cache

state.value_dim is evaluated but the result is immediately discarded. This is a no-op and appears to be a leftover from an earlier draft (perhaps intended as value_dim = state.value_dim for use later in the function, but never used).

Simply remove line 1714.

corelib/dynamicemb/dynamicemb/types.py, line 174
Abstract Storage.find return-type annotation is missing missing_table_ids

The concrete implementations (DynamicEmbStorage.find, HybridStorage.find) return an 8-tuple:

(num_missing, missing_keys, missing_indices, missing_table_ids,
 missing_scores, founds, output_scores, values)

The abstract signature declares only 7 elements – missing_table_ids is absent after missing_indices. Any future Storage subclass that faithfully matches the abstract annotation will return 7 values and break every call-site that unpacks 8. The same issue exists for Cache.find at line 258–266.

# Storage.find – correct 8-element annotation
) -> Tuple[
    int,
    torch.Tensor,           # missing_keys
    torch.Tensor,           # missing_indices
    torch.Tensor,           # missing_table_ids  ← add this
    Optional[torch.Tensor], # missing_scores
    torch.Tensor,           # founds
    torch.Tensor,           # output_scores
    torch.Tensor,           # values
]:

Cache.find (lines 258-266) should similarly include missing_table_ids as the 4th element.

greptile-apps · 2026-03-03T07:36:42Z

Additional Comments (5)

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 241
flagged_compact called with None in tensor list

In _apply_admission, when an admission strategy is active but score tracking is disabled (e.g., LRU eviction mode), missing_scores will be None. The call:

flagged_compact(
    admit_mask,
    [missing_keys, missing_indices, missing_table_ids, missing_scores],
)

passes None as the fourth element in the tensor list. Since flagged_compact is a C++ pybind11 extension that expects torch.Tensor objects, this will raise a TypeError at runtime whenever admission filtering is combined with non-LFU eviction (e.g., LRU).

The fix is to guard the compact call and only include scores when they are available:

tensors_to_compact = [missing_keys, missing_indices, missing_table_ids]
if missing_scores is not None:
    tensors_to_compact.append(missing_scores)

_, _, compacted = flagged_compact(admit_mask, tensors_to_compact)
if missing_scores is not None:
    keys_to_insert, positions_in_unique, table_ids_to_insert, scores_to_insert = compacted
else:
    keys_to_insert, positions_in_unique, table_ids_to_insert = compacted
    scores_to_insert = None

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 897
Dead expression — result of numel() is discarded

unique_keys.numel() is called here but its return value is never used. This appears to be a leftover debug or planning artifact.

        device = unique_keys.device

corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py, line 1087
STEP-strategy prefetch score offset is no longer applied

The original _get_prefetch_score method correctly offset STEP-strategy scores by num_prefetch_ahead - 1 to account for the pipelined prefetch depth:

new_score = cur_score + self.num_prefetch_ahead - 1

The replacement _reduce_table_scores simply returns max(scores) with no adjustment. This means that when using DynamicEmbScoreStrategy.STEP with prefetch_pipeline=True, freshly prefetched entries may receive an incorrect (low) eviction score and get evicted prematurely before they are consumed in the corresponding forward pass. This behaviour change should be intentional and documented, or the offset should be preserved here using self._prefetch_outstanding_keys.

corelib/dynamicemb/dynamicemb/key_value_table.py, line 2816
Dead statement — state.value_dim result is unused

state.value_dim is a property access whose return value is immediately discarded. This is likely a remnant of a variable assignment that was deleted. It should either be removed, or the result should be assigned to a variable and used (e.g., as the batch_size argument to the export loop).

def flush_cache(cache: DynamicEmbCache, storage: Storage) -> None:
    state = cache._state
    batch_size = state.threads_in_wave

corelib/dynamicemb/src/dynamic_emb_op.cu, line 337
Potential null pointer dereference with NumRegions == 0 ternary

In load_from_flat_table_kernel_vec4 (and the scalar variant), when NumRegions == 0, table_ids is documented as unused (callers pass nullptr) and scalar_table_id is used directly. However the ternary:

int64_t table_id = NumRegions == 0 ? scalar_table_id : table_ids[emb_id];

is not a constexpr if, so NVCC may still emit the dead table_ids[emb_id] read in the generated PTX for the NumRegions == 0 instantiation, producing undefined behaviour. Replace with a constexpr if to guarantee the null pointer path is never reached:

int64_t table_id;
if constexpr (NumRegions == 0) {
    table_id = scalar_table_id;
} else {
    table_id = table_ids[emb_id];
}

The same pattern appears in load_from_flat_table_kernel (scalar version) and should be fixed there too.

greptile-apps · 2026-03-03T09:48:14Z

Additional Comments (7)

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 234
None passed to flagged_compact when missing_scores is absent

missing_scores is typed Optional[torch.Tensor] and will be None whenever the eviction strategy is LRU or CSTM (i.e., not LFU). When admit_strategy is not None and the code reaches this flagged_compact call, the list [missing_keys, missing_indices, missing_table_ids, None] is passed to the C++ extension, which will crash because it expects actual tensors.

The early-return path on line 184–192 bypasses this only when admit_strategy is None. Fix by guarding missing_scores before passing it:

scores_input = (
    missing_scores
    if missing_scores is not None
    else torch.empty(0, dtype=torch.int64, device=missing_keys.device)
)
(
    _,
    _,
    (
        keys_to_insert,
        positions_in_unique,
        table_ids_to_insert,
        scores_to_insert,
    ),
) = flagged_compact(
    admit_mask,
    [missing_keys, missing_indices, missing_table_ids, scores_input],
)
scores_to_insert = scores_to_insert if missing_scores is not None else None

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 828
Dead statement — result of numel() is unused

unique_keys.numel() computes the number of elements but discards the result. This is likely a leftover from a refactor where the variable was previously named h_num_total. Either remove the line or assign it for documentation purposes:

        h_num_total = unique_keys.numel()

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 662
Only initializers[0] used for all tables in fused path

_prefetch_cache_path, _prefetch_hbm_direct_path, _generic_forward_path, and DynamicEmbeddingFunction.forward all pass initializers[0] regardless of which table a key belongs to. In a multi-table setup where tables have different per-table initializers (e.g., different embedding dimensions or initialization schemes), keys from tables 1…N will be incorrectly initialized using table 0's initializer.

The old per-table path used initializers[i] within the table loop. The fused path needs to either (1) ensure a single shared initializer is always valid for all tables, or (2) dispatch to the correct per-table initializer based on unique_table_ids.

This issue also appears in dynamicemb_eval_forward (around lines 844 and 1120), so all code paths are affected.

corelib/dynamicemb/dynamicemb/embedding_admission.py, line 78
bucket_capacity and key_type from non-first counters are silently ignored

MultiTableKVCounter only uses kv_counters[0].bucket_capacity and kv_counters[0].key_type when constructing the underlying hash table, silently ignoring the same settings on counters 1…N. If a user specifies different bucket capacities or key types per table, their configuration will be disregarded without any warning.

Consider adding validation:

if len({kv.bucket_capacity for kv in kv_counters}) > 1:
    raise ValueError(
        "All KVCounter configs must share the same bucket_capacity in fused mode."
    )
if len({kv.key_type for kv in kv_counters}) > 1:
    raise ValueError(
        "All KVCounter configs must share the same key_type in fused mode."
    )

corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py, line 601
Only storage_options[0].external_storage is checked for all tables

When creating external PS storage in caching mode, storage_options[0].external_storage determines the storage class for the fused storage object. If tables have different external storage configurations (e.g., table 0 uses default storage while another has a custom PS), only the first table's setting is honored and others are silently ignored.

Add validation to ensure consistency:

ext_storages = {opt.external_storage for opt in storage_options}
if len(ext_storages) > 1:
    raise ValueError(
        "All tables must share the same external_storage class in fused caching mode."
    )
PS = storage_options[0].external_storage

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 603
Unresolved # todo: double check comment

This TODO should be resolved before the PR is merged. The comment is on a frequency_counters.long() conversion — it would be helpful to document whether the long conversion is definitively required or if the callers already guarantee int64 dtype, then remove the TODO.

examples/hstu/test_utils.py, line 578
Misleading comment: actual value is 8 MiB, not 4M

1024 * 1024 * 8 = 8,388,608 bytes = 8 MiB. The inline comment says # 4M HBM (maybe cached) which is incorrect.

                global_hbm_for_values=1024 * 1024 * 8,  # 8MiB HBM (maybe cached)

greptile-apps · 2026-03-03T13:13:44Z

Additional Comments (4)

corelib/dynamicemb/dynamicemb/batched_dynamicemb_function.py, line 828
unique_keys.numel() is called but its return value is never stored or used. This appears to be a leftover from refactoring. Remove this dead code:

        device = unique_keys.device

corelib/dynamicemb/dynamicemb/key_value_table.py, line 1523
state.value_dim is accessed but the result is never stored or used. This is a no-op, likely a leftover from earlier development. Remove this dead code:

    batch_size = state.threads_in_wave

corelib/dynamicemb/dynamicemb/types.py, line 158
The abstract Storage.find return-type annotation declares a 7-tuple, but all concrete implementations (DynamicEmbStorage.find, HybridStorage.find) return 8 values—the missing_table_ids element is present but missing from the abstract signature. All call-sites already correctly unpack 8 values (e.g., _prefetch_cache_path line 309-323). Update the annotation to match:

    ) -> Tuple[
        int,
        torch.Tensor,
        torch.Tensor,
        torch.Tensor,
        Optional[torch.Tensor],
        torch.Tensor,
        torch.Tensor,
        torch.Tensor,
    ]:
        num_missing: int
        missing_keys: torch.Tensor
        missing_indices: torch.Tensor
        missing_table_ids: torch.Tensor
        missing_scores: torch.Tensor
        founds: torch.Tensor
        output_scores: torch.Tensor
        values: torch.Tensor

corelib/dynamicemb/dynamicemb/key_value_table.py, line 736
When score_file_path is None, the message reads "Score file None does not exist." which implies a missing file rather than communicating that score loading is intentionally skipped. This can be confusing during debugging. Clarify the intent:

    if score_file_path is None:
        print("score_file_path is None. Scores will not be loaded.")

shijieliu · 2026-03-04T06:22:14Z

resolve comments, fix expansion
benchmark, add support for multiple table benchmark
validate effectiveness of removing h2d from critical path, the kernel level change
@JacoCheung verify perf impact on e2e

* Avoid outstanding keys overflow: decrement in the end of fwd * Fix seq-emb'bw test;fix ref_counter in/decrement bug 1.Fix sequence embedding backward test: issue: there are two forward and one backward calls in one iteration, which will increment the ref_counter twice and decement it once. fix: switch to eval mode when only evaluate the model. other method: move ref_counter's decrement to the end of fwd, but may unlock the key early when there is an overlap of prefetch and backwad 2.Fix ref_counter increment/decrement bug issue: the arg slot_indices are begin from 0 for each table, but we need a flat index. besides, the flat_indices in one iteration are unique as two key can't share the same slot. fix: make increment/decrement the slot_indices for each table using table_ids. * Route to the correct ref_counter table in insert kernel * Update score in the end of prefetch;and only update it for STEP * Fix expected score in test as we update score in prefetch * Remove default value for table_ids in in/decrement_counter and make it not optional

JacoCheung · 2026-03-19T08:46:49Z

I encountered a severe perf regression with hybrid storage mode ( global_hbm_size = 0). See (new table_insert_and_evict_kernel is quite slow) timeline:

vs

The report:

benchmark_comparison_20260318.md

@shijieliu @jiashuy

JacoCheung · 2026-03-30T13:54:09Z

corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py

    def enable_prefetch(self, value: bool):
        self._enable_prefetch = value
-        self.num_prefetch_ahead = 0
+        self._prefetch_outstanding_keys.zero_()


It's not recommended to invoke self._prefetch_outstanding_keys.zero_() inside a setter.

See :blog

Use public attributes whenever appropriate, even if you expect the attribute to require functional behavior in the future.

Avoid defining setter and getter methods for your attributes. You can always turn them into properties if needed.

Use properties when you need to attach behavior to attributes and keep using them as regular attributes in your code.

Avoid side effects in properties because no one would expect operations like assignments to cause any side effects.

shijieliu changed the title ~~Fea dynamicemb table fusion~~ [Draft] Fea dynamicemb table fusion Feb 26, 2026

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

shijieliu force-pushed the fea-dynamicemb_table_fusion branch from 2c3371a to a36d744 Compare February 27, 2026 02:29

jiashuy reviewed Feb 27, 2026

View reviewed changes

shijieliu force-pushed the fea-dynamicemb_table_fusion branch from a36d744 to c7590e7 Compare February 28, 2026 01:46

shijieliu force-pushed the fea-dynamicemb_table_fusion branch 2 times, most recently from a4097d9 to 4bc15aa Compare March 3, 2026 07:21

This was referenced Mar 4, 2026

Test all keys missed in storage in caching and eval mode #311

Open

[FEA][DynamicEmb] Auto selective prefetch & caching size #295

Open

shijieliu and others added 9 commits March 11, 2026 01:24

refactor hash tale interface

4b9504e

hash table support multiple table fusion

d55f643

cache, storage, prefetch refactor

ac9cffc

fix prefetch stale, remove insert fail in cache

389ca2b

fix score

0598988

format

1abd81c

fix dump load score

a1e59bd

Return the correct score for get_score/incremental_dump when TIMESTAMP

8cc97ab

shijieliu force-pushed the fea-dynamicemb_table_fusion branch from 2590bf7 to 8cc97ab Compare March 11, 2026 08:26

jiashuy added 2 commits March 11, 2026 08:35

Fix incremental dump bug for multi-ranks case

ba04ac3

Remove incremental_dump from Storage

47cffb4

jiashuy mentioned this pull request Mar 20, 2026

error handling for empty kv list #331

Merged

3 tasks

JacoCheung reviewed Mar 30, 2026

View reviewed changes

support fill_tables API for BatchedDynamicEmbeddingTablesV2 (#5)

5a3153a

JacoCheung mentioned this pull request Apr 2, 2026

feat(benchmark): HSTU E2E training benchmark suite with progressive optimizations #340

Draft

5 tasks

shijieliu closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Fea dynamicemb table fusion#313

[Draft] Fea dynamicemb table fusion#313
shijieliu wants to merge 12 commits intoNVIDIA:mainfrom
shijieliu:fea-dynamicemb_table_fusion

shijieliu commented Feb 26, 2026 •

edited by jiashuy

Loading

Uh oh!

greptile-apps bot commented Feb 26, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

jiashuy Feb 27, 2026

Uh oh!

jiashuy Feb 27, 2026

Uh oh!

jiashuy Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 28, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

shijieliu commented Mar 4, 2026

Uh oh!

JacoCheung commented Mar 19, 2026 •

edited

Loading

Uh oh!

JacoCheung Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shijieliu commented Feb 26, 2026 • edited by jiashuy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

greptile-apps bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiashuy Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

jiashuy Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

jiashuy Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 28, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

shijieliu commented Mar 4, 2026

Uh oh!

JacoCheung commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacoCheung Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shijieliu commented Feb 26, 2026 •

edited by jiashuy

Loading

greptile-apps bot commented Feb 26, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

JacoCheung commented Mar 19, 2026 •

edited

Loading

JacoCheung Mar 30, 2026 •

edited

Loading