This document tracks all fixes applied to resolve race conditions and deadlocks in the ratchet manager during simultaneous rekey scenarios.
Purpose: Maintain a chronological record of problems discovered, solutions applied, and pipeline outcomes to inform future debugging.
Commit: 99e9e4d - Register listener before sending AliceToBob
Date: 2025-10-20
Pipeline: ✅ PASSED
When wait_for_completion=true, the listener was registered AFTER sending AliceToBob. If the rekey completed very quickly in the background loop, it could notify before we registered the listener, causing trigger_rekey(wait=true) to timeout waiting for a notification that already happened.
Moved listener registration from line 446 to line 403 (before send). Register the listener BEFORE sending AliceToBob to ensure we're subscribed to notifications before the rekey can complete.
- File:
citadel_crypt/src/ratchets/ratchet_manager.rs - Lines: Moved listener registration before
AliceToBobsend - Restructured code to handle
wait_for_completion=falseseparately
✅ Pipeline passed - fixed listener notification timing issue
Commit: ca83911 - Fix peer connection callback cancellation
Date: 2025-10-20
Pipeline: ✅ PASSED
Related Commit: 844769b (actual fix)
Using try_join! in PeerConnectionKernel caused premature cancellation of user callbacks. When one branch completed, try_join! would cancel the other branch immediately, preventing proper completion.
Evidence from logs:
- Connection counts incomplete
- Callbacks terminated mid-execution
- Flaky failures in P2P connection tests
Changed from try_join! to join! to ensure both branches complete regardless of individual completion times. This allows user callbacks to finish naturally without premature cancellation.
- File:
citadel_sdk/src/prelude/peer_connection.rs - Changed:
try_join!→join!in connection kernel
peer_to_peer_connect_transient::case_3peer_to_peer_connect_transient::case_4- Other P2P connection tests
✅ Pipeline passed - fixed 6 flaky test instances
Commit: 690a3a6 - Gracefully handle connection closure during shutdown
Date: 2025-10-21
Pipeline: ✅ PASSED
get_preferred_stream() panicked with "The C2S virtual connection should always exist" when C2S was dropped before P2P cleanup completed.
Shutdown sequence:
- Rekey shutdown triggered
- C2S connection dropped
- P2P cleanup tries to access C2S for fallback
- Panic:
expect()onNonevalue
Changed get_preferred_stream() to return Option<StreamKey> instead of panicking. Callers check for None and handle gracefully (log warning, skip operation, or use alternative).
- File:
citadel_sdk/src/prefabs/client/single_connection.rs - Changed:
expect()→ returnNone - Updated all callers to handle
Option
stress_test_p2p_messaging(multiple encryption algorithms)- Test panics on first try, timeouts on retry (120s)
✅ Pipeline passed - eliminated shutdown panics
Commit: 28aab81 - Trigger CI for flaky test analysis
Date: 2025-10-26
Pipeline: ❌ FAILED - 3/3 failures on min_delay_1_0
This was an empty commit to trigger CI and analyze a new critical deadlock discovered in test_ratchet_manager_racy_with_random_start_lag::min_delay_1_0.
Failure Details:
- Test failed 3/3 retry attempts
- 30-second timeout/deadlock on macOS
- Blocked entire CI pipeline
❌ Pipeline failed - exposed critical deadlock (see Fix #5)
Commit: f7dd436 - Resolve ratchet manager deadlock in simultaneous rekey
Date: 2025-10-27
Pipeline: ❌ FAILED - Different deadlock scenario exposed
The deadlock occurred when both clients simultaneously initiated rekey:
- Both send
AliceToBobmessages - Leader processes
BobToAlice, sendsTruncate - Loser processes
Truncate, sendsLeaderCanFinish, completes - Leader receives stale
AliceToBobfrom Loser's initial attempt - Leader tries to process it but has no constructor → deadlock
- 30-second timeout triggers panic
Key Insight: In simultaneous rekey scenarios, the Leader can receive a stale AliceToBob from the Loser's initial rekey attempt that was superseded. The code didn't handle this case—it expected to have a constructor but it had already been cleaned up.
When Leader receives AliceToBob but has no constructor for it, treat it as a stale message from a superseded simultaneous rekey attempt and skip it (continue waiting for LeaderCanFinish).
- File:
citadel_crypt/src/ratchets/ratchet_manager.rs - Lines: 665-686 (new stale message detection logic)
- Added check: If Leader role && no constructor → skip as stale
if self.role() == RekeyRole::Leader {
let has_constructor = self.constructors.lock().contains_key(&peer_metadata.next_version);
if !has_constructor {
stale_message_count += 1;
log::debug!("ignoring stale AliceToBob...");
continue; // Skip this stale message
}
}- ✅
min_delay_1_0: 5/5 passes - ✅
min_delay_2_1: 5/5 passes - ✅ All 7 ratchet_manager tests passed
❌ Pipeline still failed - Fix was incomplete, exposed second deadlock scenario (see Fix #6)
Why it failed: The fix addressed Leader receiving stale messages, but didn't address:
- Loser retaining unused Alice constructors
- Overly aggressive Idle role stale message detection
Commit: 5fb1afc - Resolve second deadlock by cleaning up unused constructors
Date: 2025-10-27
Pipeline: ⏳ PENDING (pushed, awaiting results)
After Fix #5 was applied, a NEW deadlock scenario appeared in CI:
Causal Chain:
- During simultaneous rekey, Loser creates Alice constructor for version N+1
- Leader's rekey wins, Loser completes via
TruncateorLoserCanFinish - Loser's unused Alice constructor is RETAINED (because
version > completed_version) - After both go Idle, Loser initiates new rekey using stale constructor
- Leader (now Idle) receives
AliceToBob, has no constructor - Fix #5's logic was TOO AGGRESSIVE: skipped legitimate new rekeys in Idle role
- Deadlock: Loser waits for response, Leader skips and waits for next message
Root Cause Analysis:
Looking at line 1112-1115 in ratchet_manager.rs:
self.constructors.lock().retain(|&version, _| version > completed_version);This retains constructors for versions GREATER than completed version. So if we complete version 15, constructor for version 16 is kept. This is intentional (to support concurrent rekey attempts), but causes problems when the version 16 constructor was from a SUPERSEDED simultaneous rekey.
CI Logs Evidence:
Client 20: recv AliceToBob: peer_earliest=11, peer_latest=15, local_earliest=11, local_latest=15, role=Idle
ERROR: Rekey round timed out after 30 seconds
Client 20 completed rekey (version 15), went Idle, then received Client 10's AliceToBob. Versions match (both at 15), so it passes barrier check. But Client 20 has no constructor for version 16. Fix #5 skipped it thinking it was stale, but it was actually a NEW legitimate rekey attempt!
Two-part fix:
Added constructor cleanup in TWO places:
Location 1: Truncate message handler (lines 915-962)
// After processing truncate, before sending LeaderCanFinish:
let unused_constructor_version = latest_version + 1;
if let Some(_removed) = self.constructors.lock().remove(&unused_constructor_version) {
log::debug!("removed unused Alice constructor for version {}", unused_constructor_version);
}Location 2: LoserCanFinish message handler (lines 964-1013)
// After processing LoserCanFinish, before sending LeaderCanFinish:
let unused_constructor_version = latest_version + 1;
if let Some(_removed) = self.constructors.lock().remove(&unused_constructor_version) {
log::debug!("removed unused Alice constructor for version {}", unused_constructor_version);
}Why this works: When Loser completes (whether via Truncate or LoserCanFinish), it explicitly removes the Alice constructor that was created during simultaneous rekey but never used. This prevents the stale constructor from being used in future rekey attempts.
Changed from:
if !has_constructor && (self.role() == RekeyRole::Leader || self.role() == RekeyRole::Idle)To:
if self.role() == RekeyRole::Leader {
let has_constructor = ...;
if !has_constructor {
// Skip stale message
}
}
// If Idle: don't skip, process normally as new rekeyWhy this works:
- Leader role: No constructor means stale message from superseded simultaneous rekey → skip it
- Idle role: No constructor is NORMAL! Peer is initiating a NEW rekey, and we act as Bob → process it
- File:
citadel_crypt/src/ratchets/ratchet_manager.rs - Lines 953-961: Added constructor cleanup in Truncate handler
- Lines 1013-1019: Added constructor cleanup in LoserCanFinish handler
- Lines 665-686: Refined stale message detection (Leader only, not Idle)
- ✅
min_delay_1_0: 5/5 passes - ✅
min_delay_2_1: 5/5 passes - ✅ 10/10 total passes (both tests, 5 runs each)
- ✅ All tests complete in <1s (vs 30s timeout)
- Constructor retention policy (
version > completed_version) is correct for concurrent rekeys but needs explicit cleanup for superseded simultaneous rekeys - Role-based stale detection is critical: Leader vs Idle have different semantics
- Missing constructor means different things in different roles:
- Leader: Stale message from cancelled rekey
- Idle: Normal - peer initiating new rekey
- Two cleanup paths needed: Both Truncate and LoserCanFinish need constructor removal
⏳ Pending CI results - expecting full pipeline pass
- Simultaneous operations create complex message orderings: Even with Leader/Loser roles, messages from cancelled operations can arrive out of order
- Constructor lifecycle is critical: Must match constructor creation with explicit cleanup in all completion paths
- Role semantics matter: Same situation (no constructor) means different things in different roles
- Log tracing with checkpoints: CBD-RKT markers made it possible to trace exact execution flow
- Version tracking: Logging earliest/latest versions at each step revealed synchronization issues
- Timeout as symptom: 30s timeout consistently pointed to deadlock, not performance
- Local reproduction: Minimal delay tests (
min_delay_1_0) reliably triggered races
- Explicit cleanup > implicit cleanup: Relying on version-based retention wasn't enough
- Early validation: Check role/state before processing to catch invalid scenarios
- Stale message detection: Critical for async protocols with in-flight messages
- Comprehensive testing: Fixed one deadlock, exposed another - iterative process
- ✅ All ratchet manager tests passing consistently
- ✅ No 30-second timeouts in any test
- ✅ No
[CBD-RKT-STALE]log spam (1-2 is normal, dozens is concerning) - ✅ All platforms stable (macos/ubuntu/windows)
- ❌ New timeouts in different rekey tests
- ❌ Version mismatch errors increasing
- ❌
MAX_STALE_MESSAGESlimit being hit - ❌ Flaky tests reappearing on specific platforms
- ❌ Constructor map growing unbounded
- Explicit cancellation messages: Send "cancel" when Leader wins, acknowledge on Loser side
- Sequence numbers: Add monotonic sequence numbers to rekey rounds for robust stale detection
- State machine documentation: Document all states and valid transitions
- Invariant checks: Add assertions between states to catch bugs earlier
- Constructor lifecycle tracing: Add explicit logging for constructor create/remove/use
Commit: ee10cad - Add stale message detection for BobToAlice
Date: 2025-10-27
Pipeline: ⏳ PENDING (pushed, awaiting results)
Ratchet Stability Test failed 1/10 iterations with a 360-second timeout:
Iteration 5/10 FAILED [ 360.169s]
test_messenger_racy_with_random_start_lag::min_delay_2_1::secrecy_mode_2_SecrecyMode__Perfect
WARN: Client 10 rekey error: Unexpected BobToAlice message with no loaded
local constructor for next_version 87
Analysis: During high-frequency rekey scenarios (87+ rounds!), the same stale message issue we fixed for AliceToBob (Fix #5) also applies to BobToAlice messages.
Scenario:
- Both clients perform many simultaneous rekeys (reaching version 87)
- Leader wins a contentious rekey, removes its Alice constructor
- Leader receives stale
BobToAlicefrom Loser's superseded attempt - Leader tries to process it but has no constructor → error
- Error propagates, test times out after 6 minutes
Key Insight: We fixed stale AliceToBob detection in Fix #5 but forgot the symmetric case—Leader can also receive stale BobToAlice messages from superseded simultaneous rekey attempts.
Apply the same stale detection logic to BobToAlice that we use for AliceToBob:
// Before processing BobToAlice, check if we have constructor
let has_constructor = self.constructors.lock().contains_key(&peer_metadata.next_version);
if !has_constructor && self.role() == RekeyRole::Leader {
stale_message_count += 1;
log::debug!("[CBD-RKT-STALE] Client {} (Leader) ignoring BobToAlice with no constructor", self.cid);
if stale_message_count >= MAX_STALE_MESSAGES {
return Err(...);
}
continue; // Skip this stale message
}Why this works: Same reasoning as Fix #5—if Leader has no constructor for a BobToAlice message, it's from a cancelled/superseded simultaneous rekey. Skip it and wait for proper completion messages.
- File:
citadel_crypt/src/ratchets/ratchet_manager.rs - Lines: 837-856 (new stale detection for BobToAlice)
- Pattern: Mirrors the AliceToBob stale detection logic
- ✅
test_messenger_racy_with_random_start_lag::min_delay_2_1::secrecy_mode_2_SecrecyMode__Perfect: 5/5 passes - ✅ All tests complete in <1s (vs 360s timeout)
- Symmetry matters: If one message type can be stale, its counterpart likely can too
- High-frequency testing reveals edge cases: 87 rekey rounds exposed issue that lower counts missed
- Pattern replication: Once we identified the fix pattern for AliceToBob, applying it to BobToAlice was straightforward
- Stability tests are valuable: Running 10 iterations catches low-probability races (10% failure rate)
⏳ Pending CI results - expecting Ratchet Stability Test to pass 10/10
99e9e4d- Listener registration race (PASSED)ca83911/844769b- Peer connection cancellation (PASSED)690a3a6- C2S shutdown panic (PASSED)28aab81- Empty commit for analysis (FAILED - exposed deadlock)f7dd436- First deadlock fix (FAILED - incomplete)5fb1afc- Second deadlock fix (PENDING)638adc5- Documentation (Fix log creation)ee10cad- BobToAlice stale detection (PENDING)
min_delay_1_0: Initially 0/3 → After Fix #5: 5/5 local → After Fix #6: 5/5 local → After Fix #7: N/Amin_delay_2_1: Initially 0/3 → After Fix #5: 5/5 local → After Fix #6: 5/5 local → After Fix #7: N/Atest_messenger_racy...Perfect: Initially 1/10 → After Fix #7: 5/5 local
citadel_crypt/src/ratchets/ratchet_manager.rs(Fixes #1, #5, #6, #7)citadel_sdk/src/prelude/peer_connection.rs(Fix #2)citadel_sdk/src/prefabs/client/single_connection.rs(Fix #3)
Last Updated: 2025-10-27
Status: Awaiting CI results for Fix #6 and Fix #7
Next Steps: Monitor pipeline, especially Ratchet Stability Test (10 iterations)