Skip to content

Refactor Mooncake Store: force delete + hard pin#73

Draft
cicirori wants to merge 1 commit intotorchspec-project:mainfrom
cicirori:refactor/mooncake-force-delete
Draft

Refactor Mooncake Store: force delete + hard pin#73
cicirori wants to merge 1 commit intotorchspec-project:mainfrom
cicirori:refactor/mooncake-force-delete

Conversation

@cicirori
Copy link
Copy Markdown
Collaborator

Summary

  • Replace DeferredDeleteManager with batch_remove(force=True) for immediate hidden states cleanup after consumption, fixing the production issue where large kv_lease_ttl_s causes store memory exhaustion and eviction of unconsumed objects
  • Add hard_pin support via ReplicateConfig(with_hard_pin=True) (default off, pending Mooncake Python API release)
  • Add runtime version check that fails fast if Mooncake < 0.3.10.post1
  • Pin mooncake-transfer-engine >= 0.3.10.post1 in pyproject.toml

Test plan

  • 18 unit tests pass locally: pytest tests/test_mooncake_force_delete.py -v
  • Verify batch_remove(force=True) works against real Mooncake on GPU cluster
  • Run training pipeline end-to-end with refactored deletion path

Closes #72

@cicirori cicirori force-pushed the refactor/mooncake-force-delete branch 9 times, most recently from 6868f09 to 359133a Compare April 13, 2026 19:02
Replace DeferredDeleteManager with batch_remove(force=True) for immediate
cleanup after consumption. Add hard_pin support (default off, pending
Mooncake release).

- Use batch_remove(force=True) in remove_eagle3_tensors() with retry
- Add _verify_force_delete() fail-fast for Mooncake >= 0.3.10.post1
- Add _build_replicate_config() for ReplicateConfig(with_hard_pin=True)
- Pass replicate_config through AsyncPutManager and sync put paths
- Delete deferred_delete.py entirely
- Add enable_hard_pin config with export_env/from_env/from_flat_args
- Remove unused enable_soft_pin config field
- Pin mooncake-transfer-engine >= 0.3.10.post1 in pyproject.toml
- Error cleanup in put paths uses batch_remove(force=True) with try/except
- Add 18 unit tests with conftest.py mock finder for torch-less envs

Closes torchspec-project#72
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor Mooncake Store: use force delete and hard pin to replace TTL-based deferred deletion

1 participant