-
-
Notifications
You must be signed in to change notification settings - Fork 86
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Overview
Improve HASplitBrainIT with thread-safe state management, double-checked locking, and cluster stabilization waits to eliminate race conditions in split brain testing.
Part of Epic #2968
Current Issues
- Shared state accessed without synchronization (race conditions)
- Split can trigger multiple times
- No wait for cluster stabilization after rejoin
- Fixed delays don't verify actual cluster state
Improvements to Implement
1. Thread-Safe State Management
// Add volatile fields
private volatile String firstLeader = null;
private volatile boolean split = false;
private volatile boolean rejoining = false;2. Synchronized Leader Tracking
synchronized (HASplitBrainIT.this) {
if (firstLeader == null) {
firstLeader = leader;
LogManager.instance().log(this, Level.INFO, "First leader detected: %s", leader);
}
}3. Double-Checked Locking for Split Trigger (Idempotent)
if (messagesSent >= 20 && !split) {
synchronized (HASplitBrainIT.this) {
if (split) {
return; // Another thread already triggered the split
}
split = true;
LogManager.instance().log(this, Level.INFO, "Triggering network split after %d messages", messagesSent);
// ... split logic
}
}4. Increase Split Duration (Better Quorum Establishment)
// Before: 10 seconds
timer.schedule(new TimerTask() { ... }, 10_000L);
// After: 15 seconds (allows quorum in both partitions)
timer.schedule(new TimerTask() { ... }, 15_000L);5. Increase Message Threshold for Stability
// Before: 10 messages
if (messagesSent >= 10 && !split)
// After: 20 messages (ensures cluster is stable)
if (messagesSent >= 20 && !split)6. Cluster Stabilization Wait After Rejoin
Awaitility.await("cluster stabilization")
.atMost(60, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(() -> {
// Verify all servers have same leader
String commonLeader = null;
for (int i = 0; i < getTotalServers(); i++) {
String leader = getServer(i).getHA().getLeaderName();
if (commonLeader == null) {
commonLeader = leader;
} else if (!commonLeader.equals(leader)) {
return false; // Leaders don't match
}
}
return commonLeader != null;
});Validation
# Run test 20 times to verify no race conditions
for i in {1..20}; do
echo "Run $i/20"
mvn test -pl server -Dtest=HASplitBrainIT || echo "FAILED: Run $i"
doneSuccess Criteria
- No race conditions (verified by 20 consecutive runs)
- Split triggers exactly once per test
- Proper leader election after rejoin
- All synchronization points implemented
- Thread-safe state management
- Test passes at least 19/20 times (95% success rate)
Expected Impact
Before:
- Race conditions in leader tracking
- Split can trigger multiple times
- No verification of cluster state after rejoin
- Silent failures
After:
- Thread-safe state management
- Idempotent split trigger
- Verified cluster convergence
- Clear success criteria
Time Estimate
45 minutes
Risk Level
MEDIUM - Adds synchronization, requires careful review
Documentation
See PORTING_PLAN_IT_TEST_IMPROVEMENTS.md - Phase 4, Section 4.2 for detailed instructions.
See HA_TEST_RELIABILITY_ANALYSIS.md - Section on HASplitBrainIT for analysis.
Related Issues
Part of Epic: #2968
Can be done in parallel with: HARandomCrashIT, ReplicationChangeSchemaIT improvements
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request