Skip to content

RedisCluster: Per-node health tracking and non-blocking initialize()#4083

Open
ngabhanenetskope wants to merge 1 commit into
redis:masterfrom
ngabhanenetskope:fix/pool-acquire-timeout
Open

RedisCluster: Per-node health tracking and non-blocking initialize()#4083
ngabhanenetskope wants to merge 1 commit into
redis:masterfrom
ngabhanenetskope:fix/pool-acquire-timeout

Conversation

@ngabhanenetskope
Copy link
Copy Markdown
Contributor

@ngabhanenetskope ngabhanenetskope commented May 27, 2026

Summary

Adds two mechanisms to improve RedisCluster resilience during transient node failures under high concurrency:

  1. Per-node health tracking (NodeHealthManager) — After N consecutive TimeoutError/ConnectionError on a node (default: 5), subsequent requests to that node fail immediately instead of blocking for socket_timeout. After recovery_time (default: 10s), one probe request is allowed through (half-open pattern). On success, the node is marked available again.

  2. Non-blocking initialize() — Changed from blocking lock acquisition to try-lock. If another thread is already refreshing topology, additional threads return immediately using stale (but functional for healthy nodes) topology instead of queuing on _initialization_lock.

Motivation

Under high concurrency (e.g., gevent with 250 greenlets, max_connections_per_node=50, socket_timeout=5s), a single node failure causes:

  1. 50 greenlets block on the dead node for 5s each (pool saturates instantly)
  2. All 50 timeout simultaneously and call initialize() (thundering herd)
  3. All queue on _initialization_lock, serializing 50 redundant CLUSTER SLOTS calls
  4. Health check can't respond → readiness probe fails → pod dies

With this fix:

  • Health tracking: greenlet RFE: Sharding #6 fails instantly (0ms) instead of blocking 5s → pool never saturates beyond 5 connections
  • Non-blocking initialize: only the first thread refreshes topology; others continue serving requests to healthy nodes

See #4074 for full analysis.

Configuration

rc = RedisCluster(
    startup_nodes=[...],
    node_health_failure_threshold=5,   # failures before marking unavailable
    node_health_recovery_time=10.0,    # seconds before allowing probe
)

Behavior

Scenario Before After
6th request to dead node Blocks for socket_timeout (5s) Instant ConnectionError (0ms)
50 threads calling initialize() All queue, 50 serial CLUSTER SLOTS calls 1 refreshes, 49 return immediately
Node recovers after failover Must wait for blocked threads to timeout Probe succeeds after recovery_time → instant recovery

Tests

  • TestNodeHealthManager — 7 tests covering threshold, recovery, per-node isolation, and cluster integration

Related

…luster

Two improvements to RedisCluster's resilience during node failures:

1. Per-node health tracking (NodeHealthManager):
   After N consecutive TimeoutError/ConnectionError on a node (default: 5),
   the node is marked unavailable. Subsequent requests to that node fail
   immediately with ConnectionError instead of blocking for socket_timeout
   seconds. After recovery_time (default: 10s), one probe request is
   allowed through (half-open pattern). On success, the node is marked
   available again.

   This prevents the initial pool saturation cascade: instead of 50
   greenlets all blocking on a dead node for 5s each, the 6th+ greenlet
   fails instantly once the threshold is crossed.

2. Non-blocking initialize():
   Changed from blocking lock acquisition to try-lock pattern. If another
   thread is already refreshing the cluster topology, additional threads
   return immediately instead of queuing on the lock. They use the stale
   (but functional for healthy nodes) topology until the refresh completes.

   This eliminates the thundering herd on initialize(): previously 50
   threads timing out simultaneously would all queue on _initialization_lock,
   serializing 50 redundant CLUSTER SLOTS calls. Now only the first thread
   refreshes; others continue serving requests to healthy nodes.

Configuration (RedisCluster constructor):
  - node_health_failure_threshold: int = 5
  - node_health_recovery_time: float = 10.0

Fixes redis#4074
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented May 27, 2026

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit ec068bd. Configure here.

Comment thread redis/cluster.py
raise ConnectionError(
f"Node {target_node.name} is marked unavailable "
f"(consecutive failures exceeded threshold)"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health check error triggers unnecessary pool disconnect and reinitialize

Medium Severity

The ConnectionError raised by the health check is caught by the except (ConnectionError, TimeoutError) handler below, which then executes pool disconnection under pool._lock, calls nodes_manager.initialize(), and calls move_node_to_end_of_cached_nodes — all unnecessary for a synthetic fast-fail error. Additionally, ConnectionError is in ERRORS_ALLOW_RETRY, so the outer _internal_execute_command loop retries the same unavailable node multiple times, each time triggering pool manipulation and topology refresh. Under 250 greenlets with 3 retries, this causes ~1000 pool._lock acquisitions and initialize() calls per unavailable node, significantly undermining the "fail fast in 0ms" design goal.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ec068bd. Configure here.

Comment thread redis/cluster.py
# Allow a probe request through (half-open)
del self._node_unavailable_since[node_name]
self._node_failures[node_name] = self.failure_threshold - 1
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Half-open state allows all requests, not single probe

Medium Severity

The half-open transition in is_available deletes the node from _node_unavailable_since and returns True. After this first call, every subsequent concurrent call also returns True (the node is no longer in the dict), allowing all requests through simultaneously — not just one probe as documented and intended. Under 250 greenlets, this means the potentially-still-dead node gets flooded after recovery_time, causing the same pool saturation the circuit breaker aims to prevent.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ec068bd. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RedisCluster: Transient node failure causes thundering herd on initialize(), leading to pool exhaustion under high concurrency

1 participant