Skip to content

Add thread safety and cluster stabilization to HASplitBrainIT #2972

@robfrank

Description

@robfrank

Overview

Improve HASplitBrainIT with thread-safe state management, double-checked locking, and cluster stabilization waits to eliminate race conditions in split brain testing.

Part of Epic #2968

Current Issues

  • Shared state accessed without synchronization (race conditions)
  • Split can trigger multiple times
  • No wait for cluster stabilization after rejoin
  • Fixed delays don't verify actual cluster state

Improvements to Implement

1. Thread-Safe State Management

// Add volatile fields
private volatile String firstLeader = null;
private volatile boolean split = false;
private volatile boolean rejoining = false;

2. Synchronized Leader Tracking

synchronized (HASplitBrainIT.this) {
  if (firstLeader == null) {
    firstLeader = leader;
    LogManager.instance().log(this, Level.INFO, "First leader detected: %s", leader);
  }
}

3. Double-Checked Locking for Split Trigger (Idempotent)

if (messagesSent >= 20 && !split) {
  synchronized (HASplitBrainIT.this) {
    if (split) {
      return; // Another thread already triggered the split
    }
    split = true;
    LogManager.instance().log(this, Level.INFO, "Triggering network split after %d messages", messagesSent);
    // ... split logic
  }
}

4. Increase Split Duration (Better Quorum Establishment)

// Before: 10 seconds
timer.schedule(new TimerTask() { ... }, 10_000L);

// After: 15 seconds (allows quorum in both partitions)
timer.schedule(new TimerTask() { ... }, 15_000L);

5. Increase Message Threshold for Stability

// Before: 10 messages
if (messagesSent >= 10 && !split)

// After: 20 messages (ensures cluster is stable)
if (messagesSent >= 20 && !split)

6. Cluster Stabilization Wait After Rejoin

Awaitility.await("cluster stabilization")
    .atMost(60, TimeUnit.SECONDS)
    .pollInterval(500, TimeUnit.MILLISECONDS)
    .until(() -> {
      // Verify all servers have same leader
      String commonLeader = null;
      for (int i = 0; i < getTotalServers(); i++) {
        String leader = getServer(i).getHA().getLeaderName();
        if (commonLeader == null) {
          commonLeader = leader;
        } else if (!commonLeader.equals(leader)) {
          return false; // Leaders don't match
        }
      }
      return commonLeader != null;
    });

Validation

# Run test 20 times to verify no race conditions
for i in {1..20}; do
  echo "Run $i/20"
  mvn test -pl server -Dtest=HASplitBrainIT || echo "FAILED: Run $i"
done

Success Criteria

  • No race conditions (verified by 20 consecutive runs)
  • Split triggers exactly once per test
  • Proper leader election after rejoin
  • All synchronization points implemented
  • Thread-safe state management
  • Test passes at least 19/20 times (95% success rate)

Expected Impact

Before:

  • Race conditions in leader tracking
  • Split can trigger multiple times
  • No verification of cluster state after rejoin
  • Silent failures

After:

  • Thread-safe state management
  • Idempotent split trigger
  • Verified cluster convergence
  • Clear success criteria

Time Estimate

45 minutes

Risk Level

MEDIUM - Adds synchronization, requires careful review

Documentation

See PORTING_PLAN_IT_TEST_IMPROVEMENTS.md - Phase 4, Section 4.2 for detailed instructions.
See HA_TEST_RELIABILITY_ANALYSIS.md - Section on HASplitBrainIT for analysis.

Related Issues

Part of Epic: #2968
Can be done in parallel with: HARandomCrashIT, ReplicationChangeSchemaIT improvements

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions