Background
The current hidden states lifecycle management in Mooncake Store relies on the lease TTL mechanism: after consumption, DeferredDeleteManager must wait for the TTL to expire before deleting objects (Mooncake does not allow deleting objects with active leases). This creates a fundamental coupling problem — the larger the TTL, the longer consumed hidden states remain in the store, and the higher the memory pressure.
Production Issue
With kv_lease_ttl_s: 300, consumed hidden states remain in the store for up to 5 minutes, causing:
- Total memory = in-flight hidden states + consumed-but-waiting-for-TTL hidden states
- Exceeds Mooncake Store capacity → triggers LRU eviction
- Eviction may delete keys still in use →
batch_get failures
Current Implementation
torchspec/transfer/mooncake/deferred_delete.py: DeferredDeleteManager enqueues deletions after consumption; a background thread waits ttl + 0.5s buffer then calls store.remove(), retrying up to 3 times on failure
torchspec/transfer/mooncake/eagle_store.py: remove_eagle3_tensors() submits 4 tensor keys to the deferred delete manager
torchspec/config/mooncake_config.py: kv_lease_ttl_s defaults to 5.0s
torchspec/transfer/mooncake/utils.py:110: TTL converted to milliseconds and passed to mooncake master launch args
Proposed Refactoring
Newer versions of Mooncake support force delete and hard pin. We should leverage these APIs to refactor the deletion logic:
1. Force Delete: immediate deletion after consumption
- Call force delete immediately after consumption instead of waiting for TTL expiration
DeferredDeleteManager can be significantly simplified or removed — only retry logic needs to remain
- Completely decouples TTL from deletion timing
2. Hard Pin: protect in-flight hidden states
- Apply hard pin to hidden states that are being transferred or awaiting consumption, preventing eviction from deleting them
- Unpin + force delete after consumption completes
3. Cleanup
- Remove or simplify TTL-waiting logic in
DeferredDeleteManager
kv_lease_ttl_s should no longer affect application-level data lifecycle management
Expected Outcome
- Store memory freed immediately after consumption — no TTL delay
- In-flight hidden states protected by hard pin, immune to eviction
- Store memory usage depends only on pipeline concurrency, not TTL
- Improved robustness: no longer relies on the implicit assumption that "store capacity > total in-flight hidden states"
Related Files
torchspec/transfer/mooncake/deferred_delete.py
torchspec/transfer/mooncake/eagle_store.py
torchspec/transfer/mooncake/utils.py
torchspec/config/mooncake_config.py
torchspec/training/data_fetcher.py
Background
The current hidden states lifecycle management in Mooncake Store relies on the lease TTL mechanism: after consumption,
DeferredDeleteManagermust wait for the TTL to expire before deleting objects (Mooncake does not allow deleting objects with active leases). This creates a fundamental coupling problem — the larger the TTL, the longer consumed hidden states remain in the store, and the higher the memory pressure.Production Issue
With
kv_lease_ttl_s: 300, consumed hidden states remain in the store for up to 5 minutes, causing:batch_getfailuresCurrent Implementation
torchspec/transfer/mooncake/deferred_delete.py:DeferredDeleteManagerenqueues deletions after consumption; a background thread waitsttl + 0.5s bufferthen callsstore.remove(), retrying up to 3 times on failuretorchspec/transfer/mooncake/eagle_store.py:remove_eagle3_tensors()submits 4 tensor keys to the deferred delete managertorchspec/config/mooncake_config.py:kv_lease_ttl_sdefaults to 5.0storchspec/transfer/mooncake/utils.py:110: TTL converted to milliseconds and passed to mooncake master launch argsProposed Refactoring
Newer versions of Mooncake support
force deleteandhard pin. We should leverage these APIs to refactor the deletion logic:1. Force Delete: immediate deletion after consumption
DeferredDeleteManagercan be significantly simplified or removed — only retry logic needs to remain2. Hard Pin: protect in-flight hidden states
3. Cleanup
DeferredDeleteManagerkv_lease_ttl_sshould no longer affect application-level data lifecycle managementExpected Outcome
Related Files
torchspec/transfer/mooncake/deferred_delete.pytorchspec/transfer/mooncake/eagle_store.pytorchspec/transfer/mooncake/utils.pytorchspec/config/mooncake_config.pytorchspec/training/data_fetcher.py