-
-
Notifications
You must be signed in to change notification settings - Fork 86
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Overview
Improve reliability of HARandomCrashIT chaos engineering test by replacing busy-wait loops with Awaitility timeouts, adding server selection validation, and implementing exponential backoff.
Part of Epic #2968
Current Issues
- Busy-wait loops can hang indefinitely
- No validation that selected server is actually running
- No verification that restart succeeded
- No wait for replica reconnection after restart
- Fixed delays don't adapt to failures
Improvements to Implement
1. Make Timer Daemon Thread (Prevents JVM Hangs)
// Before
new Timer("HARandomCrashIT-Timer")
// After
new Timer("HARandomCrashIT-Timer", true) // daemon=true2. Server Selection Validation
int serverId = random.nextInt(getTotalServers());
if (getServer(serverId).getStatus() != ArcadeDBServer.STATUS.ONLINE) {
continue; // Skip offline servers
}3. Replace Shutdown Busy-Wait with Awaitility
// Before: can hang indefinitely
while (getServer(serverId).getStatus() == ArcadeDBServer.STATUS.SHUTTING_DOWN)
CodeUtils.sleep(300);
// After: 30s timeout
Awaitility.await()
.atMost(30, TimeUnit.SECONDS)
.pollInterval(300, TimeUnit.MILLISECONDS)
.until(() -> getServer(finalServerId).getStatus() != ArcadeDBServer.STATUS.SHUTTING_DOWN);4. Restart Verification with Retries
boolean restartSuccess = false;
for (int retry = 0; retry < 3 && !restartSuccess; retry++) {
try {
startServer(finalServerId);
restartSuccess = true;
LogManager.instance().log(this, Level.INFO, "Server %d restarted successfully", finalServerId);
} catch (Exception e) {
LogManager.instance().log(this, Level.WARNING, "Failed to restart server %d (attempt %d)", e, finalServerId, retry + 1);
CodeUtils.sleep(1_000);
}
}5. Replica Reconnection Wait
if (restartSuccess) {
Awaitility.await()
.atMost(30, TimeUnit.SECONDS)
.pollInterval(500, TimeUnit.MILLISECONDS)
.until(() -> {
try {
return getServer(finalServerId).getHA().getReplicaConnections().size() > 0;
} catch (Exception e) {
return false;
}
});
}6. Exponential Backoff for Client Operations
// Adaptive delay based on consecutive failures
delay = Math.min(1_000 * (consecutiveFailures + 1), 5_000);Validation
# Run test 20 times to verify reliability
for i in {1..20}; do
echo "Run $i/20"
mvn test -pl server -Dtest=HARandomCrashIT || echo "FAILED: Run $i"
doneSuccess Criteria
- Test passes at least 19/20 times (95% success rate)
- No infinite loops or hangs
- Proper timeout on all waits
- Clean shutdown on test failure
- All improvements implemented
- Awaitility imports added
Expected Impact
Before:
- Flakiness: ~15-20%
- Hang risk: Present
- Timeout coverage: 0%
After:
- Flakiness: <5% (target <1%)
- Hang risk: Eliminated
- Timeout coverage: 100%
Time Estimate
60 minutes
Risk Level
MEDIUM - Changes test behavior but improves reliability
Documentation
See PORTING_PLAN_IT_TEST_IMPROVEMENTS.md - Phase 4, Section 4.1 for detailed instructions.
See HA_TEST_RELIABILITY_ANALYSIS.md - Section on HARandomCrashIT for analysis.
Related Issues
Part of Epic: #2968
Can be done in parallel with: HASplitBrainIT, ReplicationChangeSchemaIT improvements
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request