Per-slot memory tracking via per-key cache + signalModifiedKey#11
Draft
liorsve wants to merge 5 commits into
Draft
Per-slot memory tracking via per-key cache + signalModifiedKey#11liorsve wants to merge 5 commits into
liorsve wants to merge 5 commits into
Conversation
…ed branch) Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
…on with objectComputeSize test Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
b7fd8e1 to
a5419a6
Compare
…quicklist tests failing on purpose Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
6ff0560 to
a66fcfd
Compare
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
a66fcfd to
adc1c2b
Compare
Signed-off-by: Lior Sventitzky <liorsve@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note on commit history
The first two commits are squashed versions of prior tracking PRs that this work builds on:
Squashed hashtable + rax + stream + vset O(1) memory tracking— from PR Combined: hashtable + rax memory tracking with vset integration #9 (combined-hashtable-rax-tracking)added logical size tracking— from PR Track logical quicklist memory incrementally via lpBytes + compressed size #4 (quicklist-logical-size-tracking)Summary
Adds per-slot memory tracking to
CLUSTER SLOT-STATSusing a per-key size cache with signalModifiedKey as the primary hook for writes, a rehash overhead check incall()for reads, plus dedicated hooks for flush, RDB load, DB lifecycle, and defrag.One new metric reported:
memory-logical-bytes: combined user data + container overhead (field-value pairs, set members, listpack bytes, stream entries, hashtable bucket arrays, rax node overhead, vset containers, quicklist node structs)The metric supports
ORDERBYfor sorting slots by memory usage.Comparison with PR #10 (before/after in call)
slot_mem_beforefield (needed to persist before-snapshot across cmd->proc)Approach: per-key cache + signalModifiedKey
A single flat hashtable (
db->key_mem_cache) stores the last-knownlogical_bytesper key.signalModifiedKeyis the primary hook for all write mutations:db->key_mem_cacheis NULL (tracking disabled)dbFind-- computeobjectLogicalSizekey_mem_cacheslot_stats[slot].memory_logical_bytesand refresh the cache entrydbFindreturns NULL,objectLogicalSizereturns 0, delta = -cached. The cache entry is removed.The cache uses a single
hashtable(not a kvstore) since we only do point lookups by key name -- per-slot partitioning adds no value here.New function: objectLogicalSize()
O(1) function in
object.cthat reads incrementally-maintained tracked fields to compute logical size per type/encoding. Returns a singlesize_tcombining data and overhead:Key modification paths and how each is tracked
All modifications go through
signalModifiedKey, which is called by every write command, eviction, expiry, and hash field expiry path. The one exception is incremental rehashing during reads, which gets a dedicated check (see details below the table):Incremental rehash overhead tracking
The one exception to the "all changes go through signalModifiedKey" rule: hashtable bucket overhead can change during read commands.
Hash and set values using hashtable encoding undergo incremental rehashing. Every
findBucket()call -- including those from read commands like HGET and SISMEMBER, or failed writes like HDEL on a non-existent field -- migrates one bucket from the old table to the new table. This changeshashtableMemUsage()(part ofobjectLogicalSize) without any data modification and without callingsignalModifiedKey.Each rehash step can change overhead by
sizeof(bucket)(64 bytes) as child buckets are freed. When rehashing completes, the old table is freed entirely -- overhead can drop by hundreds of bytes in a single step.Solution:
clusterSlotStatsHandleRehashOverhead(c)is called fromcall()aftercmd->proc(c):dirtydidn't change (signalModifiedKey already handled it otherwise), the command belongs toCOMMAND_GROUP_HASHorCOMMAND_GROUP_SET, andclusterSlotStatsEnabled.argv[1]in the kvstore. Bails if the value isn't hashtable-encoded. ReadsobjectLogicalSize, compares against the per-key cache entry, and applies the delta if they differ.Only
argv[1]is checked because all hash/set read commands are single-key. The only multi-key command that touches a second hash/set value is SMOVE, which is a write command -- already covered by signalModifiedKey.Cost: For non-hash/set commands: zero (group check). For hash/set reads where nothing changed: one
kvstoreHashtableFind+ encoding check +objectLogicalSize+ cache comparison -- all return early. The actual delta application only runs when the size genuinely changed (rare, only during active rehashing).Self-correction without this fix: Without this check, the drift self-corrects on the next successful write command to that key, which refreshes the per-key cache via signalModifiedKey.
Active defrag support
The
key_mem_cachehashtable allocates many small objects (keySizeCacheEntry structs + sds key copies) that can cause memory fragmentation. A defrag stage (defragStageKeyMemCache) is registered per database to scan and defragment these allocations. The scan is time-bounded to avoid latency spikes.Struct changes
slotStatin cluster_legacy.h gains one field:serverDbin server.h gains one field:Each cache entry stores an sds key copy + one size_t field (~24 bytes per key).
DEBUG SLOT-VERIFY-MEMORY
New debug subcommand that independently walks all keys in a slot using
computeObjectExpectedSize()-- an O(n) walk that does NOT use objectLogicalSize or any tracking field, only pre-existing APIs (sdslen, sdsHdrSize, lpBytes, intsetBlobLen, hashtableMemUsage, raxComputeLogicalSize, etc.). Compares the walk result against slot_stats and returns OK or an error with mismatch details.Tcl integration tests
27 tests added to
tests/unit/cluster/slot-stats.tcl: