Improve HARandomCrashIT reliability with Awaitility and exponential backoff

## Overview

Improve reliability of HARandomCrashIT chaos engineering test by replacing busy-wait loops with Awaitility timeouts, adding server selection validation, and implementing exponential backoff.

Part of Epic #2968

## Current Issues

- Busy-wait loops can hang indefinitely
- No validation that selected server is actually running
- No verification that restart succeeded
- No wait for replica reconnection after restart
- Fixed delays don't adapt to failures

## Improvements to Implement

### 1. Make Timer Daemon Thread (Prevents JVM Hangs)
```java
// Before
new Timer("HARandomCrashIT-Timer")

// After
new Timer("HARandomCrashIT-Timer", true)  // daemon=true
```

### 2. Server Selection Validation
```java
int serverId = random.nextInt(getTotalServers());
if (getServer(serverId).getStatus() != ArcadeDBServer.STATUS.ONLINE) {
  continue; // Skip offline servers
}
```

### 3. Replace Shutdown Busy-Wait with Awaitility
```java
// Before: can hang indefinitely
while (getServer(serverId).getStatus() == ArcadeDBServer.STATUS.SHUTTING_DOWN)
  CodeUtils.sleep(300);

// After: 30s timeout
Awaitility.await()
    .atMost(30, TimeUnit.SECONDS)
    .pollInterval(300, TimeUnit.MILLISECONDS)
    .until(() -> getServer(finalServerId).getStatus() != ArcadeDBServer.STATUS.SHUTTING_DOWN);
```

### 4. Restart Verification with Retries
```java
boolean restartSuccess = false;
for (int retry = 0; retry < 3 && !restartSuccess; retry++) {
  try {
    startServer(finalServerId);
    restartSuccess = true;
    LogManager.instance().log(this, Level.INFO, "Server %d restarted successfully", finalServerId);
  } catch (Exception e) {
    LogManager.instance().log(this, Level.WARNING, "Failed to restart server %d (attempt %d)", e, finalServerId, retry + 1);
    CodeUtils.sleep(1_000);
  }
}
```

### 5. Replica Reconnection Wait
```java
if (restartSuccess) {
  Awaitility.await()
      .atMost(30, TimeUnit.SECONDS)
      .pollInterval(500, TimeUnit.MILLISECONDS)
      .until(() -> {
        try {
          return getServer(finalServerId).getHA().getReplicaConnections().size() > 0;
        } catch (Exception e) {
          return false;
        }
      });
}
```

### 6. Exponential Backoff for Client Operations
```java
// Adaptive delay based on consecutive failures
delay = Math.min(1_000 * (consecutiveFailures + 1), 5_000);
```

## Validation

```bash
# Run test 20 times to verify reliability
for i in {1..20}; do
  echo "Run $i/20"
  mvn test -pl server -Dtest=HARandomCrashIT || echo "FAILED: Run $i"
done
```

## Success Criteria

- [ ] Test passes at least 19/20 times (95% success rate)
- [ ] No infinite loops or hangs
- [ ] Proper timeout on all waits
- [ ] Clean shutdown on test failure
- [ ] All improvements implemented
- [ ] Awaitility imports added

## Expected Impact

**Before:**
- Flakiness: ~15-20%
- Hang risk: Present
- Timeout coverage: 0%

**After:**
- Flakiness: <5% (target <1%)
- Hang risk: Eliminated
- Timeout coverage: 100%

## Time Estimate

**60 minutes**

## Risk Level

**MEDIUM** - Changes test behavior but improves reliability

## Documentation

See `PORTING_PLAN_IT_TEST_IMPROVEMENTS.md` - Phase 4, Section 4.1 for detailed instructions.
See `HA_TEST_RELIABILITY_ANALYSIS.md` - Section on HARandomCrashIT for analysis.

## Related Issues

Part of Epic: #2968
Can be done in parallel with: HASplitBrainIT, ReplicationChangeSchemaIT improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve HARandomCrashIT reliability with Awaitility and exponential backoff #2971

Overview

Current Issues

Improvements to Implement

1. Make Timer Daemon Thread (Prevents JVM Hangs)

2. Server Selection Validation

3. Replace Shutdown Busy-Wait with Awaitility

4. Restart Verification with Retries

5. Replica Reconnection Wait

6. Exponential Backoff for Client Operations

Validation

Success Criteria

Expected Impact

Time Estimate

Risk Level

Documentation

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve HARandomCrashIT reliability with Awaitility and exponential backoff #2971

Description

Overview

Current Issues

Improvements to Implement

1. Make Timer Daemon Thread (Prevents JVM Hangs)

2. Server Selection Validation

3. Replace Shutdown Busy-Wait with Awaitility

4. Restart Verification with Retries

5. Replica Reconnection Wait

6. Exponential Backoff for Client Operations

Validation

Success Criteria

Expected Impact

Time Estimate

Risk Level

Documentation

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions