fix-split-brain.sh does not recover cluster when Sentinel quorum is broken

## Problem

The `fix-split-brain.sh` sidecar script (defined in `_configs.tpl`, lines ~454-539) does not handle the case where Sentinel quorum is broken and no master can be determined.

When `identify_redis_master()` returns an empty string — because Sentinels disagree on who the master is or cannot reach each other — the script's `if/elif` branches both evaluate false and execution falls through silently. No log message, no alert, no recovery attempt.

This is the most dangerous failure mode: the cluster is genuinely in a split-brain or no-master state, and the script designed to fix it does nothing.

## Root Cause

The current logic in `fix-split-brain.sh`:

```bash
MASTER="$(identify_redis_master)"
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
    # Branch A: this pod should be master — verify local role matches
elif [ -n "$MASTER" ]; then
    # Branch B: another pod is master — ensure this node replicates to it
fi
# No else branch — when MASTER is empty, nothing happens
```

When `MASTER` is empty (quorum broken), neither condition matches. The script loops back and keeps getting empty results indefinitely with no visibility into the failure.

## How to Reproduce

1. Deploy redis-ha with 3 Redis + 3 Sentinel pods and `splitBrainDetection.enabled: true`
2. Simulate quorum loss by killing or network-partitioning 2 of 3 Sentinel pods
3. Observe that `fix-split-brain.sh` logs show no output — it silently skips the broken state
4. The cluster remains stuck with no master election and no recovery attempt

## Related Issues

- #121 — Original split brain bug during K8s cluster upgrades (closed)
- #229 — POSIX `==` vs `=` comparison bug that silently disabled the fix entirely (closed)
- #383 — Race condition where `identify_redis_master()` returns empty on newly promoted master, causing unnecessary shutdown (open)

All three share a common pattern: the script fails silently in edge cases rather than logging or recovering.

## Suggested Fix

Add an `else` branch with a consecutive-failure counter and `sentinel reset` recovery after a configurable threshold:

```bash
QUORUM_FAIL_COUNT=0
MAX_QUORUM_FAILURES=${MAX_QUORUM_FAILURES:-5}

# Inside the loop:
MASTER="$(identify_redis_master)"
if [ "$MASTER" = "$ANNOUNCE_IP" ]; then
    QUORUM_FAIL_COUNT=0
    # existing branch A logic
elif [ -n "$MASTER" ]; then
    QUORUM_FAIL_COUNT=0
    # existing branch B logic
else
    QUORUM_FAIL_COUNT=$((QUORUM_FAIL_COUNT + 1))
    echo "WARNING: Sentinel returned no master (quorum may be broken). Failure count: $QUORUM_FAIL_COUNT/$MAX_QUORUM_FAILURES"
    if [ "$QUORUM_FAIL_COUNT" -ge "$MAX_QUORUM_FAILURES" ]; then
        echo "ERROR: Quorum broken for $MAX_QUORUM_FAILURES consecutive checks. Attempting sentinel reset..."
        for SENTINEL in $SENTINEL_HOSTS; do
            redis-cli -h "$SENTINEL" -p "${SENTINEL_PORT}" sentinel reset mymaster || true
        done
        QUORUM_FAIL_COUNT=0
    fi
fi
```

This change:
- Makes the failure **visible** in logs immediately
- Uses a configurable threshold (`MAX_QUORUM_FAILURES`, default 5) to avoid reacting to transient blips
- Attempts automated recovery via `sentinel reset` which forces Sentinels to re-discover the topology
- Resets the counter after recovery attempt to allow another cycle

## Environment

- Chart: redis-ha
- Relevant file: `templates/_configs.tpl` (fix-split-brain.sh section)
- Affects all versions with `splitBrainDetection` enabled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix-split-brain.sh does not recover cluster when Sentinel quorum is broken #398

Problem

Root Cause

How to Reproduce

Related Issues

Suggested Fix

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

fix-split-brain.sh does not recover cluster when Sentinel quorum is broken #398

Description

Problem

Root Cause

How to Reproduce

Related Issues

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions