Skip to content

Improve HARandomCrashIT reliability with Awaitility and exponential backoff #2971

@robfrank

Description

@robfrank

Overview

Improve reliability of HARandomCrashIT chaos engineering test by replacing busy-wait loops with Awaitility timeouts, adding server selection validation, and implementing exponential backoff.

Part of Epic #2968

Current Issues

  • Busy-wait loops can hang indefinitely
  • No validation that selected server is actually running
  • No verification that restart succeeded
  • No wait for replica reconnection after restart
  • Fixed delays don't adapt to failures

Improvements to Implement

1. Make Timer Daemon Thread (Prevents JVM Hangs)

// Before
new Timer("HARandomCrashIT-Timer")

// After
new Timer("HARandomCrashIT-Timer", true)  // daemon=true

2. Server Selection Validation

int serverId = random.nextInt(getTotalServers());
if (getServer(serverId).getStatus() != ArcadeDBServer.STATUS.ONLINE) {
  continue; // Skip offline servers
}

3. Replace Shutdown Busy-Wait with Awaitility

// Before: can hang indefinitely
while (getServer(serverId).getStatus() == ArcadeDBServer.STATUS.SHUTTING_DOWN)
  CodeUtils.sleep(300);

// After: 30s timeout
Awaitility.await()
    .atMost(30, TimeUnit.SECONDS)
    .pollInterval(300, TimeUnit.MILLISECONDS)
    .until(() -> getServer(finalServerId).getStatus() != ArcadeDBServer.STATUS.SHUTTING_DOWN);

4. Restart Verification with Retries

boolean restartSuccess = false;
for (int retry = 0; retry < 3 && !restartSuccess; retry++) {
  try {
    startServer(finalServerId);
    restartSuccess = true;
    LogManager.instance().log(this, Level.INFO, "Server %d restarted successfully", finalServerId);
  } catch (Exception e) {
    LogManager.instance().log(this, Level.WARNING, "Failed to restart server %d (attempt %d)", e, finalServerId, retry + 1);
    CodeUtils.sleep(1_000);
  }
}

5. Replica Reconnection Wait

if (restartSuccess) {
  Awaitility.await()
      .atMost(30, TimeUnit.SECONDS)
      .pollInterval(500, TimeUnit.MILLISECONDS)
      .until(() -> {
        try {
          return getServer(finalServerId).getHA().getReplicaConnections().size() > 0;
        } catch (Exception e) {
          return false;
        }
      });
}

6. Exponential Backoff for Client Operations

// Adaptive delay based on consecutive failures
delay = Math.min(1_000 * (consecutiveFailures + 1), 5_000);

Validation

# Run test 20 times to verify reliability
for i in {1..20}; do
  echo "Run $i/20"
  mvn test -pl server -Dtest=HARandomCrashIT || echo "FAILED: Run $i"
done

Success Criteria

  • Test passes at least 19/20 times (95% success rate)
  • No infinite loops or hangs
  • Proper timeout on all waits
  • Clean shutdown on test failure
  • All improvements implemented
  • Awaitility imports added

Expected Impact

Before:

  • Flakiness: ~15-20%
  • Hang risk: Present
  • Timeout coverage: 0%

After:

  • Flakiness: <5% (target <1%)
  • Hang risk: Eliminated
  • Timeout coverage: 100%

Time Estimate

60 minutes

Risk Level

MEDIUM - Changes test behavior but improves reliability

Documentation

See PORTING_PLAN_IT_TEST_IMPROVEMENTS.md - Phase 4, Section 4.1 for detailed instructions.
See HA_TEST_RELIABILITY_ANALYSIS.md - Section on HARandomCrashIT for analysis.

Related Issues

Part of Epic: #2968
Can be done in parallel with: HASplitBrainIT, ReplicationChangeSchemaIT improvements

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions