Skip to content

feat(k8s/cluster): add retry logic with exponential backoff to EKS provider#11

Open
rafeegnash wants to merge 1 commit intok8-gcp-supportfrom
issue-71-eks-retry-logic
Open

feat(k8s/cluster): add retry logic with exponential backoff to EKS provider#11
rafeegnash wants to merge 1 commit intok8-gcp-supportfrom
issue-71-eks-retry-logic

Conversation

@rafeegnash
Copy link
Copy Markdown

Summary

  • Adds retry logic with exponential backoff to runAWS and runEksctl methods
  • Matches the pattern already implemented in GKE provider for consistency
  • Improves resilience for transient AWS API errors

Changes

  • Add isRetryableError method to detect retryable AWS errors
  • Add errorHint method to provide helpful guidance for common errors
  • Update runAWS with retry loop using exponential backoff (200ms, 500ms, 1200ms)
  • Update runEksctl with same retry pattern
  • Add AWS CLI presence check before running commands

Retryable Error Categories

  • Throttling and rate limit errors
  • Timeout and deadline exceeded errors
  • Service unavailable and internal errors
  • Connection reset/refused errors
  • AWS-specific transient errors (RequestLimitExceeded, ProvisionedThroughputExceeded)

Test Plan

  • All existing cluster tests pass
  • New tests for isRetryableError covering 19 scenarios
  • New tests for errorHint covering 16 error types
  • Code formatted with gofmt

Closes bgdnvk#71

…ovider

Add retry logic to runAWS and runEksctl methods matching the pattern in
GKE provider. This improves resilience for transient AWS API errors.

Changes:
- Add isRetryableError method to detect retryable AWS errors (throttling,
  timeouts, service unavailable, connection issues)
- Add errorHint method to provide helpful guidance for common errors
- Update runAWS with retry loop using exponential backoff (200ms, 500ms, 1200ms)
- Update runEksctl with same retry pattern
- Add AWS CLI presence check before running commands
- Add comprehensive tests for both isRetryableError and errorHint

Retryable error categories:
- Throttling and rate limit errors
- Timeout and deadline exceeded errors
- Service unavailable and internal errors
- Connection reset/refused errors
- AWS-specific transient errors (RequestLimitExceeded, etc.)

Refs bgdnvk#71

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant