-
Notifications
You must be signed in to change notification settings - Fork 189
fix-split-brain.sh does not recover cluster when Sentinel quorum is broken #398
Description
Problem
The fix-split-brain.sh sidecar script (defined in _configs.tpl, lines ~454-539) does not handle the case where Sentinel quorum is broken and no master can be determined.
When identify_redis_master() returns an empty string — because Sentinels disagree on who the master is or cannot reach each other — the script's if/elif branches both evaluate false and execution falls through silently. No log message, no alert, no recovery attempt.
This is the most dangerous failure mode: the cluster is genuinely in a split-brain or no-master state, and the script designed to fix it does nothing.
Root Cause
The current logic in fix-split-brain.sh:
MASTER="$(identify_redis_master)"
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
# Branch A: this pod should be master — verify local role matches
elif [ -n "$MASTER" ]; then
# Branch B: another pod is master — ensure this node replicates to it
fi
# No else branch — when MASTER is empty, nothing happensWhen MASTER is empty (quorum broken), neither condition matches. The script loops back and keeps getting empty results indefinitely with no visibility into the failure.
How to Reproduce
- Deploy redis-ha with 3 Redis + 3 Sentinel pods and
splitBrainDetection.enabled: true - Simulate quorum loss by killing or network-partitioning 2 of 3 Sentinel pods
- Observe that
fix-split-brain.shlogs show no output — it silently skips the broken state - The cluster remains stuck with no master election and no recovery attempt
Related Issues
- [chart/redis-ha][BUG] K8s Cluster upgrade causes split brain #121 — Original split brain bug during K8s cluster upgrades (closed)
- [chart/redis-ha] split-brain-fix.sh is executed on "sh" but uses "==" instead of "=" for comparison #229 — POSIX
==vs=comparison bug that silently disabled the fix entirely (closed) - [chart/redis-ha][BUG] split-brain-fix causes unnecessary master shutdown during failover #383 — Race condition where
identify_redis_master()returns empty on newly promoted master, causing unnecessary shutdown (open)
All three share a common pattern: the script fails silently in edge cases rather than logging or recovering.
Suggested Fix
Add an else branch with a consecutive-failure counter and sentinel reset recovery after a configurable threshold:
QUORUM_FAIL_COUNT=0
MAX_QUORUM_FAILURES=${MAX_QUORUM_FAILURES:-5}
# Inside the loop:
MASTER="$(identify_redis_master)"
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
QUORUM_FAIL_COUNT=0
# existing branch A logic
elif [ -n "$MASTER" ]; then
QUORUM_FAIL_COUNT=0
# existing branch B logic
else
QUORUM_FAIL_COUNT=$((QUORUM_FAIL_COUNT + 1))
echo "WARNING: Sentinel returned no master (quorum may be broken). Failure count: $QUORUM_FAIL_COUNT/$MAX_QUORUM_FAILURES"
if [ "$QUORUM_FAIL_COUNT" -ge "$MAX_QUORUM_FAILURES" ]; then
echo "ERROR: Quorum broken for $MAX_QUORUM_FAILURES consecutive checks. Attempting sentinel reset..."
for SENTINEL in $SENTINEL_HOSTS; do
redis-cli -h "$SENTINEL" -p "${SENTINEL_PORT}" sentinel reset mymaster || true
done
QUORUM_FAIL_COUNT=0
fi
fiThis change:
- Makes the failure visible in logs immediately
- Uses a configurable threshold (
MAX_QUORUM_FAILURES, default 5) to avoid reacting to transient blips - Attempts automated recovery via
sentinel resetwhich forces Sentinels to re-discover the topology - Resets the counter after recovery attempt to allow another cycle
Environment
- Chart: redis-ha
- Relevant file:
templates/_configs.tpl(fix-split-brain.sh section) - Affects all versions with
splitBrainDetectionenabled