Add thread safety and cluster stabilization to HASplitBrainIT

## Overview

Improve HASplitBrainIT with thread-safe state management, double-checked locking, and cluster stabilization waits to eliminate race conditions in split brain testing.

Part of Epic #2968

## Current Issues

- Shared state accessed without synchronization (race conditions)
- Split can trigger multiple times
- No wait for cluster stabilization after rejoin
- Fixed delays don't verify actual cluster state

## Improvements to Implement

### 1. Thread-Safe State Management
```java
// Add volatile fields
private volatile String firstLeader = null;
private volatile boolean split = false;
private volatile boolean rejoining = false;
```

### 2. Synchronized Leader Tracking
```java
synchronized (HASplitBrainIT.this) {
  if (firstLeader == null) {
    firstLeader = leader;
    LogManager.instance().log(this, Level.INFO, "First leader detected: %s", leader);
  }
}
```

### 3. Double-Checked Locking for Split Trigger (Idempotent)
```java
if (messagesSent >= 20 && !split) {
  synchronized (HASplitBrainIT.this) {
    if (split) {
      return; // Another thread already triggered the split
    }
    split = true;
    LogManager.instance().log(this, Level.INFO, "Triggering network split after %d messages", messagesSent);
    // ... split logic
  }
}
```

### 4. Increase Split Duration (Better Quorum Establishment)
```java
// Before: 10 seconds
timer.schedule(new TimerTask() { ... }, 10_000L);

// After: 15 seconds (allows quorum in both partitions)
timer.schedule(new TimerTask() { ... }, 15_000L);
```

### 5. Increase Message Threshold for Stability
```java
// Before: 10 messages
if (messagesSent >= 10 && !split)

// After: 20 messages (ensures cluster is stable)
if (messagesSent >= 20 && !split)
```

### 6. Cluster Stabilization Wait After Rejoin
```java
Awaitility.await("cluster stabilization")
    .atMost(60, TimeUnit.SECONDS)
    .pollInterval(500, TimeUnit.MILLISECONDS)
    .until(() -> {
      // Verify all servers have same leader
      String commonLeader = null;
      for (int i = 0; i < getTotalServers(); i++) {
        String leader = getServer(i).getHA().getLeaderName();
        if (commonLeader == null) {
          commonLeader = leader;
        } else if (!commonLeader.equals(leader)) {
          return false; // Leaders don't match
        }
      }
      return commonLeader != null;
    });
```

## Validation

```bash
# Run test 20 times to verify no race conditions
for i in {1..20}; do
  echo "Run $i/20"
  mvn test -pl server -Dtest=HASplitBrainIT || echo "FAILED: Run $i"
done
```

## Success Criteria

- [ ] No race conditions (verified by 20 consecutive runs)
- [ ] Split triggers exactly once per test
- [ ] Proper leader election after rejoin
- [ ] All synchronization points implemented
- [ ] Thread-safe state management
- [ ] Test passes at least 19/20 times (95% success rate)

## Expected Impact

**Before:**
- Race conditions in leader tracking
- Split can trigger multiple times
- No verification of cluster state after rejoin
- Silent failures

**After:**
- Thread-safe state management
- Idempotent split trigger
- Verified cluster convergence
- Clear success criteria

## Time Estimate

**45 minutes**

## Risk Level

**MEDIUM** - Adds synchronization, requires careful review

## Documentation

See `PORTING_PLAN_IT_TEST_IMPROVEMENTS.md` - Phase 4, Section 4.2 for detailed instructions.
See `HA_TEST_RELIABILITY_ANALYSIS.md` - Section on HASplitBrainIT for analysis.

## Related Issues

Part of Epic: #2968
Can be done in parallel with: HARandomCrashIT, ReplicationChangeSchemaIT improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add thread safety and cluster stabilization to HASplitBrainIT #2972

Overview

Current Issues

Improvements to Implement

1. Thread-Safe State Management

2. Synchronized Leader Tracking

3. Double-Checked Locking for Split Trigger (Idempotent)

4. Increase Split Duration (Better Quorum Establishment)

5. Increase Message Threshold for Stability

6. Cluster Stabilization Wait After Rejoin

Validation

Success Criteria

Expected Impact

Time Estimate

Risk Level

Documentation

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add thread safety and cluster stabilization to HASplitBrainIT #2972

Description

Overview

Current Issues

Improvements to Implement

1. Thread-Safe State Management

2. Synchronized Leader Tracking

3. Double-Checked Locking for Split Trigger (Idempotent)

4. Increase Split Duration (Better Quorum Establishment)

5. Increase Message Threshold for Stability

6. Cluster Stabilization Wait After Rejoin

Validation

Success Criteria

Expected Impact

Time Estimate

Risk Level

Documentation

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions