Skip to content

fix-split-brain.sh does not recover cluster when Sentinel quorum is broken #398

@kishoregv

Description

@kishoregv

Problem

The fix-split-brain.sh sidecar script (defined in _configs.tpl, lines ~454-539) does not handle the case where Sentinel quorum is broken and no master can be determined.

When identify_redis_master() returns an empty string — because Sentinels disagree on who the master is or cannot reach each other — the script's if/elif branches both evaluate false and execution falls through silently. No log message, no alert, no recovery attempt.

This is the most dangerous failure mode: the cluster is genuinely in a split-brain or no-master state, and the script designed to fix it does nothing.

Root Cause

The current logic in fix-split-brain.sh:

MASTER="$(identify_redis_master)"
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
    # Branch A: this pod should be master — verify local role matches
elif [ -n "$MASTER" ]; then
    # Branch B: another pod is master — ensure this node replicates to it
fi
# No else branch — when MASTER is empty, nothing happens

When MASTER is empty (quorum broken), neither condition matches. The script loops back and keeps getting empty results indefinitely with no visibility into the failure.

How to Reproduce

  1. Deploy redis-ha with 3 Redis + 3 Sentinel pods and splitBrainDetection.enabled: true
  2. Simulate quorum loss by killing or network-partitioning 2 of 3 Sentinel pods
  3. Observe that fix-split-brain.sh logs show no output — it silently skips the broken state
  4. The cluster remains stuck with no master election and no recovery attempt

Related Issues

All three share a common pattern: the script fails silently in edge cases rather than logging or recovering.

Suggested Fix

Add an else branch with a consecutive-failure counter and sentinel reset recovery after a configurable threshold:

QUORUM_FAIL_COUNT=0
MAX_QUORUM_FAILURES=${MAX_QUORUM_FAILURES:-5}

# Inside the loop:
MASTER="$(identify_redis_master)"
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
    QUORUM_FAIL_COUNT=0
    # existing branch A logic
elif [ -n "$MASTER" ]; then
    QUORUM_FAIL_COUNT=0
    # existing branch B logic
else
    QUORUM_FAIL_COUNT=$((QUORUM_FAIL_COUNT + 1))
    echo "WARNING: Sentinel returned no master (quorum may be broken). Failure count: $QUORUM_FAIL_COUNT/$MAX_QUORUM_FAILURES"
    if [ "$QUORUM_FAIL_COUNT" -ge "$MAX_QUORUM_FAILURES" ]; then
        echo "ERROR: Quorum broken for $MAX_QUORUM_FAILURES consecutive checks. Attempting sentinel reset..."
        for SENTINEL in $SENTINEL_HOSTS; do
            redis-cli -h "$SENTINEL" -p "${SENTINEL_PORT}" sentinel reset mymaster || true
        done
        QUORUM_FAIL_COUNT=0
    fi
fi

This change:

  • Makes the failure visible in logs immediately
  • Uses a configurable threshold (MAX_QUORUM_FAILURES, default 5) to avoid reacting to transient blips
  • Attempts automated recovery via sentinel reset which forces Sentinels to re-discover the topology
  • Resets the counter after recovery attempt to allow another cycle

Environment

  • Chart: redis-ha
  • Relevant file: templates/_configs.tpl (fix-split-brain.sh section)
  • Affects all versions with splitBrainDetection enabled

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions