feat: add statistics and adaptive growth to TagValueSeriesIDCache by davidby-influx · Pull Request #27480 · influxdata/influxdb

davidby-influx · 2026-05-28T00:30:37Z

Add statistics for hits and misses and cache
capacity to TagValueSeriesIDCache to permit
adaptive cache growth up to a configured
max capacity when hit rate is below a
configured target.

Includes new configuration parameters and tests

Closes #27479

Add statistics for hits and misses and cache capacity to TagValueSeriesIDCache to permit adaptive cache growth up to a configured max capacity when hit rate is below a configured target. Includes new configuration parameters and tests Closes #27479

Copilot

Pull request overview

This PR adds observability and adaptive growth for the TSI1 TagValueSeriesIDCache, allowing cache capacity to grow toward a configured maximum when observed hit rate is below target.

Changes:

Adds cache hit/miss/eviction/size/capacity statistics and wires them through engine statistics.
Adds adaptive cache capacity growth policy and related configuration validation.
Adds unit and integration tests for statistics, adaptive resizing, and engine stats plumbing.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`tsdb/index/tsi1/index.go`	Adds cache sizing options, adaptive cache construction, and index statistics forwarding.
`tsdb/index/tsi1/cache.go`	Implements cache statistics, adaptive growth policy, resize logging, and capacity tracking.
`tsdb/index/tsi1/cache_test.go`	Adds tests for statistics and adaptive cache behavior.
`tsdb/index/inmem/inmem.go`	Adds no-op statistics method for interface compatibility.
`tsdb/index.go`	Extends the index interface with statistics reporting.
`tsdb/engine/tsm1/engine.go`	Includes index statistics in engine statistics output.
`tsdb/engine/tsm1/engine_test.go`	Tests TSI1 cache statistics are emitted through engine statistics.
`tsdb/config.go`	Adds adaptive cache config fields and validation.
`tsdb/config_test.go`	Adds validation coverage for adaptive cache configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Shut up the AI about an overflow if the cache capacity is the number of stars in the galaxy Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

davidby-influx · 2026-05-28T19:04:32Z

Where it helps

Working set slightly larger than the fixed cache (the target win).
With the old fixed default (100 entries), a workload whose hot set of {measurement, tagKey, tagValue} predicates is, say, 300 thrashes: each query for a hot tuple that just got
evicted re-runs the bitmap merge across partitions and index files. Adaptive sizing detects "low hit rate while evicting" and grows to fit, converting repeated O(merge) work into
O(1) hits. This cuts query latency and the CPU burned on merges, and it smooths the CPU spikiness that thrashing causes — a real stability gain, not just throughput.

Observability. The new tsi1_cache stats (hit/miss/eviction/size/capacity) make a previously opaque component diagnosable. An operator can now see thrashing directly instead of
inferring it from query latency. This is the most broadly useful change — it helps even with adaptive sizing turned off.

Self-tuning without restarts. A deployment with high predicate diversity no longer needs an operator to guess series-id-set-cache-size and restart. It climbs to its real need on
its own, bounded by max.

Bounded, predictable ceiling. Doubling up to a hard max gives a deterministic memory upper bound — you get a bigger cache without unbounded blowup.

Designed safeguard against write-only spurious growth. The gets < minSamples floor means pure-write or read-light windows (where hit rate is 0 simply because nobody's reading)
don't trigger growth. Without that floor, a write-heavy shard would balloon the cache for no reason. This is a deliberately good behavior.

Where it harms

Low read-locality / scan workloads (the dominant failure mode).
The policy cannot distinguish "cache too small" from "inherently no reuse." A query pattern that sweeps many distinct tuples once each — a broad dashboard refresh, an exploratory
SHOW/wildcard scan, a backfill — produces a low hit rate and steady eviction, which is exactly the grow signal. So the cache doubles repeatedly toward max, spending heap on entries
that will never be hit again. Growing does nothing to fix misses caused by uniqueness rather than capacity. Because each entry is a SeriesIDSet bitmap (potentially large), this can
be a substantial, wasted heap increase.

"Never shrinks" makes transient bursts permanent.
A one-time event (nightly batch, a compaction-driven scan, an analyst's ad-hoc query) that briefly drives low locality can ratchet the cache to max — and it stays there for the life
of the shard even after steady-state returns to a tiny working set. Good for avoiding thrash oscillation; bad for reclaiming memory after a spike. There is no decay path.

No global memory budget; per-shard multiplication.
max-size bounds one cache. Every shard's TSI index has its own independently-adaptive cache. On a server with many shards, a low-locality workload that touches many shards can drive
many caches toward max simultaneously, and aggregate memory = Σ(per-shard max). Nothing coordinates a server-wide ceiling, so the real worst-case memory is max × shard_count, which
is easy to under-estimate when setting max.

Misconfiguration turns "adaptive" into "always max."
target-hit-rate validation permits up to 1.0. Set it to 0.99 or 1.0 and essentially every real workload sits below target, so the cache climbs to max regardless of whether growth
helps — a silent memory regression dressed up as tuning. The knob's danger isn't obvious to an operator.

Trading query CPU for GC pressure.
A larger cache means more live heap → longer GC mark phases and/or more frequent GC at a given allocation rate. On latency-sensitive nodes, growing the cache to save merge CPU can
raise tail latency via GC. Net effect is workload-dependent and not something the policy accounts for.

Unconditional hot-path cost, even with adaptive off.
Get/Put now do atomic.AddInt64 on every call (hit/miss/size). They're under the already-held write lock so there's no new contention, but it's extra unconditional work for all TSI
users, including those who never enable adaptive sizing. I'd expect this to be negligible next to the bitmap work, but I can't confirm the magnitude without a benchmark — and it's a
non-zero tax paid by everyone for a feature most won't turn on.

First-decision noise + slow reaction when large.
The first resize uses cumulative-since-start counters (lastHits/lastMisses = 0), so cold-start misses bias the very first decision toward growing early. At the other end, once
capacity is large, each window needs capacity evictions to sample, so it reacts slowly to genuine working-set changes. Minor, but it means early growth can be a touch trigger-happy
and late adaptation sluggish.

Bottom line

It's a clear win when misses are caused by a working set that modestly exceeds a too-small fixed cache, and the statistics are a strict improvement for everyone. The risk
concentrates in memory: the policy grows in response to a low hit rate even when low locality (not capacity) is the cause, never gives the memory back, and has only a per-cache — not
per-server — bound. The safest default is to ship it disabled (which the config does), document max-size as a per-shard figure to be multiplied by shard count, and warn against high
target-hit-rate. If you wanted to harden it later, the highest-value additions would be a decay/shrink path and gating growth on evidence that recently-evicted keys are actually
being re-requested (reuse), rather than on hit rate alone.

The TSI tag-value series-ID cache now exposes hit, miss, eviction, size, and capacity statistics and can size itself within operator-set bounds instead of using a single fixed capacity. Adaptive sizing is enabled by setting both series-id-set-cache-max-size and series-id-set-cache-target-hit-rate; it stays disabled (fixed-capacity, as before) when either is zero. The cache grows by doubling toward the max ceiling when its windowed hit rate falls below target while under eviction pressure, and decays back down when the working set shrinks: it trims capacity and sheds only the least-recently-used entries left untouched over a self-tuning observation window. A cooldown between resizes prevents grow/shrink oscillation, and behavior is unchanged for deployments that do not opt in.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

+// the observed Get hit rate falls below target. Resizes are driven from
+// the eviction path: after every `capacity` evictions, the cache samples
+// its windowed hit rate and grows once if the policy fires. The cache
+// never shrinks. minSamples is a floor on the number of Get operations


+  # starts at series-id-set-cache-size and grows (doubling) up to
+  # series-id-set-cache-max-size whenever its observed query hit rate falls
+  # below series-id-set-cache-target-hit-rate while it is evicting. The cache
+  # never shrinks. Adaptive sizing trades higher heap usage for fewer cache


+	if c.SeriesIDSetCacheTargetHitRate < 0 || c.SeriesIDSetCacheTargetHitRate >= 1 {
+		return ErrSeriesIDSetCacheTargetHitRateRange


+	if target <= 0 || target >= 1 {
+		panic(fmt.Sprintf("NewAdaptiveTagValueSeriesIDCache: target must be in (0, 1), got %v", target))


The TSI tag-value series-ID cache now exposes hit, miss, eviction, size, and capacity statistics and can size itself within operator-set bounds instead of using a single fixed capacity. Adaptive sizing is enabled by setting both series-id-set-cache-max-size and series-id-set-cache-target-hit-rate; it stays disabled (fixed-capacity, as before) when either is zero. The cache grows by doubling toward the max when its windowed hit rate falls below target while under eviction pressure, and decays back down when the working set shrinks: after a coverage-sized window of Gets it trims capacity toward the observed LRU footprint, shedding only the least-recently-used entries left untouched over the window. A Gets-based cooldown between resizes and a conservatism-tunable eviction gate (a confidence bound on the at-target eviction distribution) prevent grow/shrink oscillation. Behavior is unchanged by default. Closes #27479

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Split the eviction counter into Evictions (forced under Put pressure) and ShrinkEvictions (voluntary trim) so only forced evictions feed the binomial eviction-gate. Fix two adjacent bugs: evictLRULocked recedes deepestTouched to e.Prev() instead of nil-ing it when the LRU is the boundary, preventing a spurious shrink that sheds warm entries after a Put on a fully-warm cache; checkShrink rewinds the hit/miss baseline by one when opening a new window so the window-opening Get is counted in hitsW/missesW. Cap a single shrink event at maxShrinkEvictPerEvent (1024) to bound write-lock hold time on huge caches; decay continues over later windows. Add regression tests for both fixes, a policy-table row pinning the per-event cap, and an integration test (ShrinkRepeatsWhenCapped) that exercises two back-to-back capped shrinks and verifies decay continues when a single event would have shed more. Adapt existing tests for the new ShrinkEvictions stat field and get()'s new (set, hit) return.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

davidby-influx self-assigned this May 28, 2026

davidby-influx added area/performance area/stats area/tsi 1.x kind/feature team/edge labels May 28, 2026

davidby-influx requested a review from Copilot May 28, 2026 00:30

Copilot started reviewing on behalf of davidby-influx May 28, 2026 00:30 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread tsdb/index/tsi1/index.go

Comment thread tsdb/config.go Outdated

fix: update config.sample.toml, propagate logger

5af272e

davidby-influx requested a review from Copilot May 28, 2026 03:32

Copilot started reviewing on behalf of davidby-influx May 28, 2026 03:32 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

fix: sentinel errors, string constants, dead code removal

9a13fee

davidby-influx requested a review from Copilot May 28, 2026 04:05

Copilot started reviewing on behalf of davidby-influx May 28, 2026 04:05 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread tsdb/index/tsi1/cache.go

chore: update struct alignment for atomics

59e5d1e

davidby-influx requested a review from Copilot May 28, 2026 04:25

Copilot started reviewing on behalf of davidby-influx May 28, 2026 04:25 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread tsdb/index/tsi1/cache.go Outdated

fix: avoid unlikely overflow

026f4a0

Shut up the AI about an overflow if the cache capacity is the number of stars in the galaxy Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

davidby-influx requested a review from Copilot May 29, 2026 04:46

Copilot started reviewing on behalf of davidby-influx May 29, 2026 04:47 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

davidby-influx linked an issue May 29, 2026 that may be closed by this pull request

Add stats for TagValueSeriesIDCache and adaptive cache growth #27479

Open

davidby-influx requested a review from Copilot May 29, 2026 22:06

Copilot started reviewing on behalf of davidby-influx May 29, 2026 22:06 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread tsdb/index/tsi1/cache.go

fix: tighter arg checking, plus test

0f6122f

davidby-influx requested a review from Copilot May 29, 2026 22:18

Copilot started reviewing on behalf of davidby-influx May 29, 2026 22:19 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread tsdb/index/tsi1/cache.go

davidby-influx requested a review from Copilot May 30, 2026 00:23

Copilot started reviewing on behalf of davidby-influx May 30, 2026 00:23 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

Comment thread tsdb/index/tsi1/cache.go

fix: reset window state on shrink

67125cf

davidby-influx requested a review from Copilot May 30, 2026 01:24

Copilot started reviewing on behalf of davidby-influx May 30, 2026 01:25 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

fix: extend test to check shrink_eviction stat

ba3091b

		if c.SeriesIDSetCacheTargetHitRate < 0 \|\| c.SeriesIDSetCacheTargetHitRate >= 1 {
		return ErrSeriesIDSetCacheTargetHitRateRange

		if target <= 0 \|\| target >= 1 {
		panic(fmt.Sprintf("NewAdaptiveTagValueSeriesIDCache: target must be in (0, 1), got %v", target))

Conversation

davidby-influx commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

davidby-influx commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants