Skip to content

Per-slot memory tracking in CLUSTER SLOT-STATS - no key cache#10

Draft
liorsve wants to merge 6 commits into
unstablefrom
per-slot-memory-aggregation
Draft

Per-slot memory tracking in CLUSTER SLOT-STATS - no key cache#10
liorsve wants to merge 6 commits into
unstablefrom
per-slot-memory-aggregation

Conversation

@liorsve

@liorsve liorsve commented Apr 12, 2026

Copy link
Copy Markdown
Owner

Commit history

The first two commits are squashed versions of prior tracking PRs that this work builds on:

Summary

Adds per-slot memory tracking to CLUSTER SLOT-STATS, reporting one new metric:

  • memory-logical-bytes: combined user data + container overhead (field-value pairs, set members, listpack bytes, stream entries, hashtable bucket arrays, rax node overhead, vset containers, quicklist node structs)

The metric supports ORDERBY for sorting slots by memory usage.

New function: objectLogicalSize()

O(1) function in object.c that reads incrementally-maintained tracked fields to compute logical size per type/encoding. Returns a single size_t combining data and overhead:

Type Encoding What's included
STRING RAW/EMBSTR sdsReqSize (header + content + null)
STRING INT 0 (value embedded in robj pointer)
LIST QUICKLIST tracked_data_bytes + sizeof(quicklist) + len * sizeof(quicklistNode)
LIST LISTPACK lpBytes
SET HASHTABLE hashtableTrackedDataBytes + hashtableMemUsage
SET INTSET intsetBlobLen
SET LISTPACK lpBytes
HASH HASHTABLE hashtableTrackedDataBytes + hashtableMemUsage + vsetLogicalSize
HASH LISTPACK lpBytes
ZSET LISTPACK lpBytes
ZSET SKIPLIST 0 (no O(1) tracking yet)
STREAM STREAM tracked_data_bytes + tracked_overhead + sizeof(stream)

Key modification paths and how each is tracked

All data and overhead modifications flow through write commands or explicit out-of-call mutation points. Each is covered by a before/after snapshot or a direct subtraction. The one exception is incremental rehashing during reads, which gets a dedicated lightweight check (see details below the table):

Path Where handled How
Normal commands (SET, HSET, SADD, RPUSH, XADD, DEL, etc.) call() in server.c Before/after key size snapshot via getKeysFromCommand around cmd->proc(c)
In-place mutations (HSET adding field, SREM, LPOP, XTRIM) call() in server.c Before/after key size snapshot via getKeysFromCommand around cmd->proc(c)
Key eviction (maxmemory pressure) performEvictions() in evict.c Subtract objectLogicalSize before dbGenericDelete
Key expiry (active expire cycle) deleteExpiredKeyAndPropagateWithDictIndex() in db.c Subtract objectLogicalSize before dbGenericDelete; skipped during lazy expiry (executing_command flag) to avoid double-count with call() hooks
Key expiry (lazy expire during command) call() in server.c Handled by before/after hooks; explicit expiry hook skips via executing_command flag to avoid double-count
Hash field expiry (partial -- some fields expire) dbReclaimExpiredFields() in db.c Before/after objectLogicalSize around hashTypeDeleteExpiredFields
Hash field expiry (all fields expire -- key deleted) dbReclaimExpiredFields() in db.c Subtract remaining objectLogicalSize before dbDelete
FLUSHALL / FLUSHDB signalFlushedDb() in db.c Zero memory_logical_bytes for all slots
RDB loading dbAddRDBLoad() in db.c Add objectLogicalSize to slot stats after insert
AOF RESP loading loadAppendOnlyFiles() in aof.c Full recount of all slots after AOF load completes (RESP commands bypass call())
Slot ownership changes clusterSlotStatReset() in cluster_slot_stats.c memset zeros entire slotStat including new field
Incremental rehash during hash/set reads call() in server.c Lightweight before/after check on argv[1] for COMMAND_GROUP_HASH/SET when dirty unchanged (see below)

Incremental rehash overhead tracking

The one exception to the "all changes go through writes" rule: hashtable bucket overhead can change during read commands.

Hash and set values using hashtable encoding undergo incremental rehashing. Every findBucket() call -- including those from read commands like HGET and SISMEMBER -- migrates one bucket from the old table to the new table. This changes hashtableMemUsage() (part of objectLogicalSize) without any data modification and without going through the write tracking path.

Each rehash step can change overhead by sizeof(bucket) (64 bytes) as child buckets are freed. When rehashing completes, the old table is freed entirely -- overhead can drop by hundreds of bytes in a single step.

Solution: A lightweight check in call() after cmd->proc(c):

  1. Gate: Only fires when dirty didn't change (write path already handled it), the command belongs to COMMAND_GROUP_HASH or COMMAND_GROUP_SET, and clusterSlotStatsEnabled.
  2. Before cmd->proc: clusterSlotStatsSnapshotRehashOverhead(c) looks up argv[1], checks if it's hashtable-encoded AND mid-rehash. If not, returns 0 (skip). If yes, snapshots current objectLogicalSize.
  3. After cmd->proc: clusterSlotStatsApplyRehashOverhead(c) re-reads objectLogicalSize and applies the delta to slot_stats.

Only argv[1] is checked because all hash/set read commands are single-key. The only multi-key command that touches a second hash/set value is SMOVE, which is a write command -- already covered by the full before/after write path.

Cost: For non-hash/set commands: zero (group check). For hash/set reads where the HT isn't rehashing: one dbFind + encoding check + hashtableIsRehashing -- all return early. The actual delta application only runs when overhead genuinely changed (rare, only during active rehashing).

Self-correction without this fix: Without this check, the drift self-corrects on the next successful write command to that key, which refreshes the before/after baseline.

Struct changes

slotStat in cluster_legacy.h gains one field:

int64_t memory_logical_bytes;

client in server.h gains one field for per-command state:

size_t slot_mem_before;

DEBUG SLOT-VERIFY-MEMORY

New debug subcommand that independently walks all keys in a slot using computeObjectExpectedSize() -- an O(n) walk that does NOT use objectLogicalSize or any tracking field, only pre-existing APIs (sdslen, sdsHdrSize, lpBytes, intsetBlobLen, hashtableMemUsage, raxComputeLogicalSize, etc.). Compares the walk result against slot_stats and returns OK or an error with mismatch details.

Tcl integration tests

26 tests added to tests/unit/cluster/slot-stats.tcl:

  • String: SET, overwrite, integer encoding, DEL
  • Hash: in-place growth (50 fields with verification after each HSET), field expiry (partial and full key deletion)
  • Set: SADD 200 members then SREM 100, verifying decrease
  • List: RPUSH 100 items then LPOP 50, verifying decrease
  • Stream: XADD 50 entries across multiple rax nodes then XTRIM, verifying decrease
  • Cross-slot independence: two keys in different slots, DEL one, other unchanged
  • Same-slot accumulation: two keys with hash tags in same slot
  • Mixed types in same slot: string + hash + set with hash tags
  • FLUSHALL: resets all slot memory to zero
  • Key expiry: SET with PX, wait, verify memory drops to zero
  • Lazy expiry + write: SET with PX, disable active expiry, overwrite after expiry -- verifies no double-counting
  • MULTI/EXEC: transaction with string + 50-field hash + 50-member set + 50-item list
  • Eviction: fill slot, set tight maxmemory with allkeys-lru, verify after eviction
  • Hash field expiry (partial): HSETEX with TTL, verify memory decreases after fields expire
  • Hash field expiry (empty key): HSETEX where all fields expire, verify key deleted and memory is zero
  • RDB reload: create mixed types, DEBUG RELOAD, verify slot stats match
  • AOF reload: create keys, BGREWRITEAOF, add more keys, DEBUG LOADAOF -- verifies post-load recount
  • Mismatch detection: CONFIG RESETSTAT corrupts stats, DEBUG SLOT-VERIFY-MEMORY catches it
  • ORDERBY memory-logical-bytes: two keys with different sizes, verify descending order
  • Hash rehash overhead repro: 100 HSET + 50 HDEL + 100 more HSET with verify after each operation
  • Hash+set rehash overhead repro: 300 rounds of interleaved hash/set writes AND reads (HGET, SISMEMBER) with verify after each
  • Random fuzzer: 300 rounds x 200 random operations across 4 types and 2 slots with verify after each operation

liorsve added 2 commits April 9, 2026 13:20
…ed branch)

Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
…on with objectComputeSize test

Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
@liorsve liorsve force-pushed the per-slot-memory-aggregation branch from 018d38d to 93895d3 Compare April 12, 2026 16:09
@liorsve liorsve changed the title Per-slot memory tracking in CLUSTER SLOT-STATS Per-slot memory tracking in CLUSTER SLOT-STATS - no key cache Apr 13, 2026
@liorsve liorsve force-pushed the per-slot-memory-aggregation branch 2 times, most recently from 5d8716d to 09a3a14 Compare April 13, 2026 08:45
@liorsve liorsve force-pushed the per-slot-memory-aggregation branch 3 times, most recently from a41b6d7 to f3873f2 Compare April 13, 2026 14:10
…on purpose

Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
@liorsve liorsve force-pushed the per-slot-memory-aggregation branch from f3873f2 to f1193de Compare April 13, 2026 14:20
liorsve added 3 commits April 14, 2026 08:37
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant