Skip to content

Upstream 3pick#336

Draft
fshhr46 wants to merge 19 commits intoNVIDIA:mainfrom
fshhr46:upstream-3pick
Draft

Upstream 3pick#336
fshhr46 wants to merge 19 commits intoNVIDIA:mainfrom
fshhr46:upstream-3pick

Conversation

@fshhr46
Copy link
Copy Markdown
Contributor

@fshhr46 fshhr46 commented Mar 27, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

jiashuy and others added 19 commits March 26, 2026 06:07
* Refactor dyanmicemb with Cache&Storage.

* Add score support and sync of event and stream in prefetch

* Cache&Storage C++ codes

* Restore optimizer.py

* Format dynamicemb' code.

* Pass compile.

* Support eval mode in Cache&Storage

* Cache metrics.

* Test forward&backward w/wt cache/eval in BatchedDynamicEmbeddingTablesV2

* update HKV

* Test prefetch and flush.

* Test externel PS.

* Benchmark Cache&Storage

* Update benchmark results on EOS

* Fix unit test script

* Add load API for Storage

* Fix memory consumption calculation

* Fix memory consumption and copyright
* Admission counter table interface.

* Counter table implementation

* Unit test and Fix IMA of table operations

* Unit test of table.dump and table.load

* Add table operation unit tests to CI

* Unit test of table.dump&load when num_gpu mismatched.

* Unit test to table.insert_and_evict.

* Add todo to unlock using index.

* Remove kvcounter in dynamicemb_table_v2

* Refine unit test of load and dump APIs of ScoredHashTable

* Fix potential issues and rigorously test the score in test_embedding_dump_load.py
* Add gradient clipping in dynamicemb

* Fix potential capacity mismatch issue in incremental_dump
* Draft usage of KVCounter.

* Add FrequencyAdmissionStrategy and AdmissionStrategy class.

* Add storage only admission in training(Need to test).

* Add storage only admission in training Step 2.

* Add cache and storage admission in training(Need to test).

* Pass Admission Counter to KeyValueTable lookup.

* Add has admission flag for dedeup part in input dist.

* Add test for embedding admssion and fix bug in lookup.

* Fix cache frequency bug.

* Fix some bugs.

* Rebase Counter table and fix some comment's issues.

* Move admit stratedy class to embedding_admission.py.

* Rebase Counter table and move counter init outsite tableoptions.

* Fix some bugs.

* Dump and load correct counter files in dynamic_table_v2

* Unit test of counter table's checkpoint.

* Decoupling lookup and admission.

* Decoupling training and insert.

* Do admission before initalizer.

* Add DynamicEmbInitializerArgs for admit strategy.

* Add Initializer for non-admit embs.

* Fix circular dependency about initializer class.

* Update document about embedding admission

* Add admission options to example.py.

* Add comment for admission threshlod.

* Fix segmented unique rebase bugs.

* Fix segmented unique rebase bugs step2.

* Fix duplicated check to counter keys in test_embedding_dump_load.py

* Fix test and format codes

* Move create_initializer to initializer.py to unify the creation logic.

* Add score_strategy and admit strategy into get_grouped_key of DynamicEmbTableOptions.

* Fix admission test assertion for mutli gpus.

* Integrated initialize_non_admitted_embeddings.

* Pass admit strategy and evict strategy from table to function.

* Fix bugs.

* Fix bugs.

* Remove some comments.

---------

Co-authored-by: Jiashu Yao <jiashu.yao.cn@gmail.com>
Required by f5b608e C++ sources which use find_and_update and other
APIs added in 9c197a9c558d1e8285c2e50c1974f0f102826f11.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace all src/ .cu/.h/.cuh files with their f5b608e versions to
ensure consistency with the cherry-picked Python code. Key additions:
find_pointers_with_scores, insert_and_evict_with_scores,
find_and_initialize bindings in dynamic_emb_op.cu; updated
hkv_variable.cuh/h with new virtual method overrides; all 18
hkv_variable_instantiations regenerated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
f5b608e sources use std::optional in dynamic_variable_base.h and
hkv_variable.h. Without C++17, nvcc and g++ both fail to resolve
std::optional, causing cascading override errors.

Also track build.sh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace all dynamicemb/*.py and shard/planner files with f5b608e
versions to match the C++ extension bindings. Key changes:
- batched_dynamicemb_function.py: drop lookup_*_dense imports (not
  in our extension); use lookup_forward/backward + find_and_initialize
- shard/embedding.py: add DynamicEmbeddingCollectionContext class
- dump_load.py: re-export DynamicEmbInitializerArgs/Mode,
  DynamicEmbScoreStrategy, DynamicEmbTableOptions for backward
  compatibility with cherry-picked tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Machine has 2 GPUs; all test scripts hardcoded 4. Changed NUM_GPUS
and --nproc_per_node to 2 in all affected scripts. Restored
test_lfu_scores.sh from f5b608e (was missing from cherry-pick).
Replaced test_embedding_dump_load.py with f5b608e version to fix
missing imports (click, typing, record decorator).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document all build errors encountered (std::optional, submodule,
override errors), Python alignment issues, and test fixes applied
during the build-install-test loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All test shell scripts now read DYNAMICEMB_NUM_GPUS (default 2) to set
--nproc_per_node, replacing hardcoded values of 2, 4, or 8.  Scripts
that run multiple torchrun calls concurrently with test_embedding_dump_load
also read DYNAMICEMB_MASTER_PORT (per-script defaults 29601–29604) so they
can run in parallel without competing on port 29500.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
f5b608e renamed BatchedDynamicEmbeddingTables to
BatchedDynamicEmbeddingTablesV2. Add an alias at module level so
existing code importing the old name continues to work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test used module.tables[0] (returns KeyValueTable, not DynamicEmbTable)
and module.optimizer (returns BaseDynamicEmbeddingOptimizerV2 with
incompatible update() signature). Fix by constructing hashtables via
initialize_hashtables() directly and instantiating old-style optimizers
via dynamicemb_optimizer_class with explicit table options.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document final test results and the additional fixes applied during
the test loop (GPU count, master ports, missing test files, API fixes).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
KeyValueTable wrappers so goatee callers (which pass KeyValueTable
objects) can use free functions like dyn_emb_cols/rows/capacity,
insert_or_assign and export_batch that previously only accepted the
raw C++ DynamicEmbTable.

- dynamicemb_config.py:
  - dyn_emb_to_torch: passthrough when already a torch.dtype
  - _unwrap_table(): extracts DynamicEmbTable from KeyValueTable
  - Python wrappers: dyn_emb_cols, dyn_emb_rows, dyn_emb_capacity,
    insert_or_assign, export_batch — all accept either table type

- key_value_table.py:
  - optstate_dim() → backward-compat alias for optim_state_dim()
  - get_initial_optstate() → forwarded to underlying DynamicEmbTable

- optimizer.py:
  - BaseDynamicEmbeddingOptimizer.register(BaseDynamicEmbeddingOptimizerV2)
    so isinstance(v2_optimizer, BaseDynamicEmbeddingOptimizer) is True

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants