Skip to content

Per-slot memory tracking via per-key cache + signalModifiedKey#11

Draft
liorsve wants to merge 5 commits into
unstablefrom
per-slot-memory-per-key-cache
Draft

Per-slot memory tracking via per-key cache + signalModifiedKey#11
liorsve wants to merge 5 commits into
unstablefrom
per-slot-memory-per-key-cache

Conversation

@liorsve

@liorsve liorsve commented Apr 13, 2026

Copy link
Copy Markdown
Owner

Note on commit history

The first two commits are squashed versions of prior tracking PRs that this work builds on:


Summary

Adds per-slot memory tracking to CLUSTER SLOT-STATS using a per-key size cache with signalModifiedKey as the primary hook for writes, a rehash overhead check in call() for reads, plus dedicated hooks for flush, RDB load, DB lifecycle, and defrag.

One new metric reported:

  • memory-logical-bytes: combined user data + container overhead (field-value pairs, set members, listpack bytes, stream entries, hashtable bucket arrays, rax node overhead, vset containers, quicklist node structs)

The metric supports ORDERBY for sorting slots by memory usage.

Comparison with PR #10 (before/after in call)

PR #10 (before/after in call) This PR (per-key cache)
Tracking hook points 8 (call writes, call reads, eviction, active expiry, hash field expiry, flush, RDB, AOF recount) 4 (signalModifiedKey, call reads, flush, RDB) -- eviction, expiry, field expiry, and AOF load are covered automatically via signalModifiedKey
Extra lookups per write command 2N dbFind where N = number of keys in the command (before + after snapshot) 1 dbFind + 1 cache lookup per key changed via signalModifiedKey
Per-key memory overhead 0 ~24 bytes per key (sds copy + size_t + struct)
Client struct overhead slot_mem_before field (needed to persist before-snapshot across cmd->proc) None (per-key cache serves as the baseline)
Covers eviction/expiry Needs explicit hooks Automatic via signalModifiedKey
Lazy expiry double-count risk Needs executing_command guard No risk -- single hook point
DB lifecycle management None Cache must be cleared/swapped with DB; defrag support required
Rehash overhead on reads Before/after snapshot on argv[1] Compares objectLogicalSize against per-key cache

Approach: per-key cache + signalModifiedKey

A single flat hashtable (db->key_mem_cache) stores the last-known logical_bytes per key. signalModifiedKey is the primary hook for all write mutations:

  1. Early return if db->key_mem_cache is NULL (tracking disabled)
  2. Look up current value via dbFind -- compute objectLogicalSize
  3. Look up cached size from key_mem_cache
  4. Delta = current - cached
  5. Update slot_stats[slot].memory_logical_bytes and refresh the cache entry
  6. For new keys: no cache entry exists, cached = 0, delta = +current. A new cache entry is inserted.
  7. For deleted keys: dbFind returns NULL, objectLogicalSize returns 0, delta = -cached. The cache entry is removed.

The cache uses a single hashtable (not a kvstore) since we only do point lookups by key name -- per-slot partitioning adds no value here.

New function: objectLogicalSize()

O(1) function in object.c that reads incrementally-maintained tracked fields to compute logical size per type/encoding. Returns a single size_t combining data and overhead:

Type Encoding What's included
STRING RAW/EMBSTR sdsReqSize (header + content + null)
STRING INT 0 (value embedded in robj pointer)
LIST QUICKLIST tracked_data_bytes + sizeof(quicklist) + len * sizeof(quicklistNode)
LIST LISTPACK lpBytes
SET HASHTABLE hashtableTrackedDataBytes + hashtableMemUsage
SET INTSET intsetBlobLen
SET LISTPACK lpBytes
HASH HASHTABLE hashtableTrackedDataBytes + hashtableMemUsage + vsetLogicalSize
HASH LISTPACK lpBytes
ZSET LISTPACK lpBytes
ZSET SKIPLIST 0 (no O(1) tracking yet)
STREAM STREAM tracked_data_bytes + tracked_overhead + sizeof(stream)

Key modification paths and how each is tracked

All modifications go through signalModifiedKey, which is called by every write command, eviction, expiry, and hash field expiry path. The one exception is incremental rehashing during reads, which gets a dedicated check (see details below the table):

Path Where handled How
All commands (SET, HSET, SADD, DEL, MULTI/EXEC, etc.) signalModifiedKey() in db.c Diff current vs cached, update slot stats
In-place mutations (HSET adding field, SREM, LPOP, XTRIM) signalModifiedKey() in db.c Diff current vs cached, update slot stats
Key eviction (maxmemory pressure) signalModifiedKey() in db.c Automatic -- eviction calls signalModifiedKey
Key expiry (active + lazy) signalModifiedKey() in db.c Automatic -- expiry calls signalModifiedKey
Hash field expiry (partial + empty key) signalModifiedKey() in db.c Automatic -- field expiry calls signalModifiedKey
FLUSHALL / FLUSHDB (async) signalFlushedDb() in db.c + emptyDbAsync() in lazyfree.c Zero slot stats; cache hashtable released and recreated
FLUSHALL / FLUSHDB (sync) signalFlushedDb() in db.c + emptyDbStructure() in db.c Zero slot stats; cache hashtable emptied
RDB loading dbAddRDBLoad() in db.c Add to slot stats + insert cache entry
AOF RESP loading signalModifiedKey() via cmd->proc() Automatic -- commands during AOF replay call signalModifiedKey
SWAPDB dbSwapDatabases() in db.c key_mem_cache swapped along with keys/expires
RDB reload (swapMainDbWithTempDb) swapMainDbWithTempDb() in db.c key_mem_cache swapped along with keys/expires
Temp DB discard discardTempDb() in db.c key_mem_cache released before freeing DB
Slot ownership changes clusterSlotStatReset() memset zeros entire slotStat
Incremental rehash during hash/set reads call() in server.c check on argv[1] via per-key cache (see below)

Incremental rehash overhead tracking

The one exception to the "all changes go through signalModifiedKey" rule: hashtable bucket overhead can change during read commands.

Hash and set values using hashtable encoding undergo incremental rehashing. Every findBucket() call -- including those from read commands like HGET and SISMEMBER, or failed writes like HDEL on a non-existent field -- migrates one bucket from the old table to the new table. This changes hashtableMemUsage() (part of objectLogicalSize) without any data modification and without calling signalModifiedKey.

Each rehash step can change overhead by sizeof(bucket) (64 bytes) as child buckets are freed. When rehashing completes, the old table is freed entirely -- overhead can drop by hundreds of bytes in a single step.

Solution: clusterSlotStatsHandleRehashOverhead(c) is called from call() after cmd->proc(c):

  1. Gate in call(): Only fires when dirty didn't change (signalModifiedKey already handled it otherwise), the command belongs to COMMAND_GROUP_HASH or COMMAND_GROUP_SET, and clusterSlotStatsEnabled.
  2. Inside the function: Looks up argv[1] in the kvstore. Bails if the value isn't hashtable-encoded. Reads objectLogicalSize, compares against the per-key cache entry, and applies the delta if they differ.

Only argv[1] is checked because all hash/set read commands are single-key. The only multi-key command that touches a second hash/set value is SMOVE, which is a write command -- already covered by signalModifiedKey.

Cost: For non-hash/set commands: zero (group check). For hash/set reads where nothing changed: one kvstoreHashtableFind + encoding check + objectLogicalSize + cache comparison -- all return early. The actual delta application only runs when the size genuinely changed (rare, only during active rehashing).

Self-correction without this fix: Without this check, the drift self-corrects on the next successful write command to that key, which refreshes the per-key cache via signalModifiedKey.

Active defrag support

The key_mem_cache hashtable allocates many small objects (keySizeCacheEntry structs + sds key copies) that can cause memory fragmentation. A defrag stage (defragStageKeyMemCache) is registered per database to scan and defragment these allocations. The scan is time-bounded to avoid latency spikes.

Struct changes

slotStat in cluster_legacy.h gains one field:

int64_t memory_logical_bytes;

serverDb in server.h gains one field:

hashtable *key_mem_cache;

Each cache entry stores an sds key copy + one size_t field (~24 bytes per key).

DEBUG SLOT-VERIFY-MEMORY

New debug subcommand that independently walks all keys in a slot using computeObjectExpectedSize() -- an O(n) walk that does NOT use objectLogicalSize or any tracking field, only pre-existing APIs (sdslen, sdsHdrSize, lpBytes, intsetBlobLen, hashtableMemUsage, raxComputeLogicalSize, etc.). Compares the walk result against slot_stats and returns OK or an error with mismatch details.

Tcl integration tests

27 tests added to tests/unit/cluster/slot-stats.tcl:

  • String: SET, overwrite, integer encoding, DEL
  • Hash: in-place growth (50 fields with verification after each HSET), field expiry (partial and full key deletion)
  • Set: SADD 200 members then SREM 100, verifying decrease
  • List: RPUSH 100 items then LPOP 50, verifying decrease
  • Stream: XADD 50 entries across multiple rax nodes then XTRIM, verifying decrease
  • Cross-slot independence: two keys in different slots, DEL one, other unchanged
  • Same-slot accumulation: two keys with hash tags in same slot
  • Mixed types in same slot: string + hash + set with hash tags
  • FLUSHALL: resets all slot memory to zero
  • FLUSHDB SYNC: resets cache and re-tracks new keys correctly
  • Key expiry: SET with PX, wait, verify memory drops to zero
  • Lazy expiry + write: SET with PX, disable active expiry, overwrite after expiry -- verifies no double-counting
  • MULTI/EXEC: transaction with string + 50-field hash + 50-member set + 50-item list
  • Eviction: fill slot, set tight maxmemory with allkeys-lru, verify after eviction
  • Hash field expiry (partial): HSETEX with TTL, verify memory decreases after fields expire
  • Hash field expiry (empty key): HSETEX where all fields expire, verify key deleted and memory is zero
  • RDB reload: create mixed types, DEBUG RELOAD, verify slot stats match
  • AOF reload: create keys, BGREWRITEAOF, add more keys, DEBUG LOADAOF -- verifies correctness
  • Mismatch detection: CONFIG RESETSTAT corrupts stats, DEBUG SLOT-VERIFY-MEMORY catches it
  • ORDERBY memory-logical-bytes: two keys with different sizes, verify descending order
  • Hash rehash overhead repro: 100 HSET + 50 HDEL + 100 more HSET with verify after each operation
  • Hash+set rehash overhead repro: 300 rounds of interleaved hash/set writes with verify after each operation
  • Random fuzzer: 300 rounds x 200 random operations across 4 types and 2 slots with verify after each operation

liorsve added 2 commits April 9, 2026 13:20
…ed branch)

Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
…on with objectComputeSize test

Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
@liorsve liorsve force-pushed the per-slot-memory-per-key-cache branch 3 times, most recently from b7fd8e1 to a5419a6 Compare April 13, 2026 13:53
…quicklist tests failing on purpose

Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
@liorsve liorsve force-pushed the per-slot-memory-per-key-cache branch 2 times, most recently from 6ff0560 to a66fcfd Compare April 14, 2026 07:50
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
@liorsve liorsve force-pushed the per-slot-memory-per-key-cache branch from a66fcfd to adc1c2b Compare April 14, 2026 08:54
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant