Skip to content

Refactor Mooncake Store: use force delete and hard pin to replace TTL-based deferred deletion #72

@cicirori

Description

@cicirori

Background

The current hidden states lifecycle management in Mooncake Store relies on the lease TTL mechanism: after consumption, DeferredDeleteManager must wait for the TTL to expire before deleting objects (Mooncake does not allow deleting objects with active leases). This creates a fundamental coupling problem — the larger the TTL, the longer consumed hidden states remain in the store, and the higher the memory pressure.

Production Issue

With kv_lease_ttl_s: 300, consumed hidden states remain in the store for up to 5 minutes, causing:

  • Total memory = in-flight hidden states + consumed-but-waiting-for-TTL hidden states
  • Exceeds Mooncake Store capacity → triggers LRU eviction
  • Eviction may delete keys still in use → batch_get failures

Current Implementation

  • torchspec/transfer/mooncake/deferred_delete.py: DeferredDeleteManager enqueues deletions after consumption; a background thread waits ttl + 0.5s buffer then calls store.remove(), retrying up to 3 times on failure
  • torchspec/transfer/mooncake/eagle_store.py: remove_eagle3_tensors() submits 4 tensor keys to the deferred delete manager
  • torchspec/config/mooncake_config.py: kv_lease_ttl_s defaults to 5.0s
  • torchspec/transfer/mooncake/utils.py:110: TTL converted to milliseconds and passed to mooncake master launch args

Proposed Refactoring

Newer versions of Mooncake support force delete and hard pin. We should leverage these APIs to refactor the deletion logic:

1. Force Delete: immediate deletion after consumption

  • Call force delete immediately after consumption instead of waiting for TTL expiration
  • DeferredDeleteManager can be significantly simplified or removed — only retry logic needs to remain
  • Completely decouples TTL from deletion timing

2. Hard Pin: protect in-flight hidden states

  • Apply hard pin to hidden states that are being transferred or awaiting consumption, preventing eviction from deleting them
  • Unpin + force delete after consumption completes

3. Cleanup

  • Remove or simplify TTL-waiting logic in DeferredDeleteManager
  • kv_lease_ttl_s should no longer affect application-level data lifecycle management

Expected Outcome

  • Store memory freed immediately after consumption — no TTL delay
  • In-flight hidden states protected by hard pin, immune to eviction
  • Store memory usage depends only on pipeline concurrency, not TTL
  • Improved robustness: no longer relies on the implicit assumption that "store capacity > total in-flight hidden states"

Related Files

  • torchspec/transfer/mooncake/deferred_delete.py
  • torchspec/transfer/mooncake/eagle_store.py
  • torchspec/transfer/mooncake/utils.py
  • torchspec/config/mooncake_config.py
  • torchspec/training/data_fetcher.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions