Per-slot memory tracking via per-key cache + signalModifiedKey by liorsve · Pull Request #11 · liorsve/valkey

liorsve · 2026-04-13T09:05:53Z

Note on commit history

The first two commits are squashed versions of prior tracking PRs that this work builds on:

Squashed hashtable + rax + stream + vset O(1) memory tracking — from PR Combined: hashtable + rax memory tracking with vset integration #9 (combined-hashtable-rax-tracking)
added logical size tracking — from PR Track logical quicklist memory incrementally via lpBytes + compressed size #4 (quicklist-logical-size-tracking)

Summary

Adds per-slot memory tracking to CLUSTER SLOT-STATS using a per-key size cache with signalModifiedKey as the primary hook for writes, a rehash overhead check in call() for reads, plus dedicated hooks for flush, RDB load, DB lifecycle, and defrag.

One new metric reported:

memory-logical-bytes: combined user data + container overhead (field-value pairs, set members, listpack bytes, stream entries, hashtable bucket arrays, rax node overhead, vset containers, quicklist node structs)

The metric supports ORDERBY for sorting slots by memory usage.

Comparison with PR #10 (before/after in call)

	PR #10 (before/after in call)	This PR (per-key cache)
Tracking hook points	8 (call writes, call reads, eviction, active expiry, hash field expiry, flush, RDB, AOF recount)	4 (signalModifiedKey, call reads, flush, RDB) -- eviction, expiry, field expiry, and AOF load are covered automatically via signalModifiedKey
Extra lookups per write command	2N dbFind where N = number of keys in the command (before + after snapshot)	1 dbFind + 1 cache lookup per key changed via signalModifiedKey
Per-key memory overhead	0	~24 bytes per key (sds copy + size_t + struct)
Client struct overhead	`slot_mem_before` field (needed to persist before-snapshot across cmd->proc)	None (per-key cache serves as the baseline)
Covers eviction/expiry	Needs explicit hooks	Automatic via signalModifiedKey
Lazy expiry double-count risk	Needs executing_command guard	No risk -- single hook point
DB lifecycle management	None	Cache must be cleared/swapped with DB; defrag support required
Rehash overhead on reads	Before/after snapshot on argv[1]	Compares objectLogicalSize against per-key cache

Approach: per-key cache + signalModifiedKey

A single flat hashtable (db->key_mem_cache) stores the last-known logical_bytes per key. signalModifiedKey is the primary hook for all write mutations:

Early return if db->key_mem_cache is NULL (tracking disabled)
Look up current value via dbFind -- compute objectLogicalSize
Look up cached size from key_mem_cache
Delta = current - cached
Update slot_stats[slot].memory_logical_bytes and refresh the cache entry
For new keys: no cache entry exists, cached = 0, delta = +current. A new cache entry is inserted.
For deleted keys: dbFind returns NULL, objectLogicalSize returns 0, delta = -cached. The cache entry is removed.

The cache uses a single hashtable (not a kvstore) since we only do point lookups by key name -- per-slot partitioning adds no value here.

New function: objectLogicalSize()

O(1) function in object.c that reads incrementally-maintained tracked fields to compute logical size per type/encoding. Returns a single size_t combining data and overhead:

Type	Encoding	What's included
STRING	RAW/EMBSTR	sdsReqSize (header + content + null)
STRING	INT	0 (value embedded in robj pointer)
LIST	QUICKLIST	tracked_data_bytes + sizeof(quicklist) + len * sizeof(quicklistNode)
LIST	LISTPACK	lpBytes
SET	HASHTABLE	hashtableTrackedDataBytes + hashtableMemUsage
SET	INTSET	intsetBlobLen
SET	LISTPACK	lpBytes
HASH	HASHTABLE	hashtableTrackedDataBytes + hashtableMemUsage + vsetLogicalSize
HASH	LISTPACK	lpBytes
ZSET	LISTPACK	lpBytes
ZSET	SKIPLIST	0 (no O(1) tracking yet)
STREAM	STREAM	tracked_data_bytes + tracked_overhead + sizeof(stream)

Key modification paths and how each is tracked

All modifications go through signalModifiedKey, which is called by every write command, eviction, expiry, and hash field expiry path. The one exception is incremental rehashing during reads, which gets a dedicated check (see details below the table):

Path	Where handled	How
All commands (SET, HSET, SADD, DEL, MULTI/EXEC, etc.)	signalModifiedKey() in db.c	Diff current vs cached, update slot stats
In-place mutations (HSET adding field, SREM, LPOP, XTRIM)	signalModifiedKey() in db.c	Diff current vs cached, update slot stats
Key eviction (maxmemory pressure)	signalModifiedKey() in db.c	Automatic -- eviction calls signalModifiedKey
Key expiry (active + lazy)	signalModifiedKey() in db.c	Automatic -- expiry calls signalModifiedKey
Hash field expiry (partial + empty key)	signalModifiedKey() in db.c	Automatic -- field expiry calls signalModifiedKey
FLUSHALL / FLUSHDB (async)	signalFlushedDb() in db.c + emptyDbAsync() in lazyfree.c	Zero slot stats; cache hashtable released and recreated
FLUSHALL / FLUSHDB (sync)	signalFlushedDb() in db.c + emptyDbStructure() in db.c	Zero slot stats; cache hashtable emptied
RDB loading	dbAddRDBLoad() in db.c	Add to slot stats + insert cache entry
AOF RESP loading	signalModifiedKey() via cmd->proc()	Automatic -- commands during AOF replay call signalModifiedKey
SWAPDB	dbSwapDatabases() in db.c	key_mem_cache swapped along with keys/expires
RDB reload (swapMainDbWithTempDb)	swapMainDbWithTempDb() in db.c	key_mem_cache swapped along with keys/expires
Temp DB discard	discardTempDb() in db.c	key_mem_cache released before freeing DB
Slot ownership changes	clusterSlotStatReset()	memset zeros entire slotStat
Incremental rehash during hash/set reads	call() in server.c	check on argv[1] via per-key cache (see below)

Incremental rehash overhead tracking

The one exception to the "all changes go through signalModifiedKey" rule: hashtable bucket overhead can change during read commands.

Hash and set values using hashtable encoding undergo incremental rehashing. Every findBucket() call -- including those from read commands like HGET and SISMEMBER, or failed writes like HDEL on a non-existent field -- migrates one bucket from the old table to the new table. This changes hashtableMemUsage() (part of objectLogicalSize) without any data modification and without calling signalModifiedKey.

Each rehash step can change overhead by sizeof(bucket) (64 bytes) as child buckets are freed. When rehashing completes, the old table is freed entirely -- overhead can drop by hundreds of bytes in a single step.

Solution: clusterSlotStatsHandleRehashOverhead(c) is called from call() after cmd->proc(c):

Gate in call(): Only fires when dirty didn't change (signalModifiedKey already handled it otherwise), the command belongs to COMMAND_GROUP_HASH or COMMAND_GROUP_SET, and clusterSlotStatsEnabled.
Inside the function: Looks up argv[1] in the kvstore. Bails if the value isn't hashtable-encoded. Reads objectLogicalSize, compares against the per-key cache entry, and applies the delta if they differ.

Only argv[1] is checked because all hash/set read commands are single-key. The only multi-key command that touches a second hash/set value is SMOVE, which is a write command -- already covered by signalModifiedKey.

Cost: For non-hash/set commands: zero (group check). For hash/set reads where nothing changed: one kvstoreHashtableFind + encoding check + objectLogicalSize + cache comparison -- all return early. The actual delta application only runs when the size genuinely changed (rare, only during active rehashing).

Self-correction without this fix: Without this check, the drift self-corrects on the next successful write command to that key, which refreshes the per-key cache via signalModifiedKey.

Active defrag support

The key_mem_cache hashtable allocates many small objects (keySizeCacheEntry structs + sds key copies) that can cause memory fragmentation. A defrag stage (defragStageKeyMemCache) is registered per database to scan and defragment these allocations. The scan is time-bounded to avoid latency spikes.

Struct changes

slotStat in cluster_legacy.h gains one field:

int64_t memory_logical_bytes;

serverDb in server.h gains one field:

hashtable *key_mem_cache;

Each cache entry stores an sds key copy + one size_t field (~24 bytes per key).

DEBUG SLOT-VERIFY-MEMORY

New debug subcommand that independently walks all keys in a slot using computeObjectExpectedSize() -- an O(n) walk that does NOT use objectLogicalSize or any tracking field, only pre-existing APIs (sdslen, sdsHdrSize, lpBytes, intsetBlobLen, hashtableMemUsage, raxComputeLogicalSize, etc.). Compares the walk result against slot_stats and returns OK or an error with mismatch details.

Tcl integration tests

27 tests added to tests/unit/cluster/slot-stats.tcl:

String: SET, overwrite, integer encoding, DEL
Hash: in-place growth (50 fields with verification after each HSET), field expiry (partial and full key deletion)
Set: SADD 200 members then SREM 100, verifying decrease
List: RPUSH 100 items then LPOP 50, verifying decrease
Stream: XADD 50 entries across multiple rax nodes then XTRIM, verifying decrease
Cross-slot independence: two keys in different slots, DEL one, other unchanged
Same-slot accumulation: two keys with hash tags in same slot
Mixed types in same slot: string + hash + set with hash tags
FLUSHALL: resets all slot memory to zero
FLUSHDB SYNC: resets cache and re-tracks new keys correctly
Key expiry: SET with PX, wait, verify memory drops to zero
Lazy expiry + write: SET with PX, disable active expiry, overwrite after expiry -- verifies no double-counting
MULTI/EXEC: transaction with string + 50-field hash + 50-member set + 50-item list
Eviction: fill slot, set tight maxmemory with allkeys-lru, verify after eviction
Hash field expiry (partial): HSETEX with TTL, verify memory decreases after fields expire
Hash field expiry (empty key): HSETEX where all fields expire, verify key deleted and memory is zero
RDB reload: create mixed types, DEBUG RELOAD, verify slot stats match
AOF reload: create keys, BGREWRITEAOF, add more keys, DEBUG LOADAOF -- verifies correctness
Mismatch detection: CONFIG RESETSTAT corrupts stats, DEBUG SLOT-VERIFY-MEMORY catches it
ORDERBY memory-logical-bytes: two keys with different sizes, verify descending order
Hash rehash overhead repro: 100 HSET + 50 HDEL + 100 more HSET with verify after each operation
Hash+set rehash overhead repro: 300 rounds of interleaved hash/set writes with verify after each operation
Random fuzzer: 300 rounds x 200 random operations across 4 types and 2 slots with verify after each operation

…ed branch) Signed-off-by: Lior Sventitzky <liorsve@amazon.com>

…on with objectComputeSize test Signed-off-by: Lior Sventitzky <liorsve@amazon.com>

…quicklist tests failing on purpose Signed-off-by: Lior Sventitzky <liorsve@amazon.com>

Signed-off-by: Lior Sventitzky <liorsve@amazon.com>

liorsve added 2 commits April 9, 2026 13:20

Squashed hashtable + rax + stream + vset O(1) memory tracking (combin…

85c0f0c

…ed branch) Signed-off-by: Lior Sventitzky <liorsve@amazon.com>

added logical size tracking, added new correctness tests and comparis…

14448a2

…on with objectComputeSize test Signed-off-by: Lior Sventitzky <liorsve@amazon.com>

github-actions Bot assigned liorsve Apr 13, 2026

liorsve force-pushed the per-slot-memory-per-key-cache branch 3 times, most recently from b7fd8e1 to a5419a6 Compare April 13, 2026 13:53

added mem kvstore + 3 hooks at modifiedKey, flush and rdb & disabled …

28ab3cb

…quicklist tests failing on purpose Signed-off-by: Lior Sventitzky <liorsve@amazon.com>

liorsve force-pushed the per-slot-memory-per-key-cache branch 2 times, most recently from 6ff0560 to a66fcfd Compare April 14, 2026 07:50

fixed rehash not passing through signalmodifiedkey bug

adc1c2b

Signed-off-by: Lior Sventitzky <liorsve@amazon.com>

liorsve force-pushed the per-slot-memory-per-key-cache branch from a66fcfd to adc1c2b Compare April 14, 2026 08:54

unified data and overhead stats to 1

48fe745

Signed-off-by: Lior Sventitzky <liorsve@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-slot memory tracking via per-key cache + signalModifiedKey#11

Per-slot memory tracking via per-key cache + signalModifiedKey#11
liorsve wants to merge 5 commits into
unstablefrom
per-slot-memory-per-key-cache

liorsve commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liorsve commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note on commit history

Summary

Comparison with PR #10 (before/after in call)

Approach: per-key cache + signalModifiedKey

New function: objectLogicalSize()

Key modification paths and how each is tracked

Incremental rehash overhead tracking

Active defrag support

Struct changes

DEBUG SLOT-VERIFY-MEMORY

Tcl integration tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liorsve commented Apr 13, 2026 •

edited

Loading