-
Notifications
You must be signed in to change notification settings - Fork 49
Fix MinIO health probe timeouts causing CrashLoopBackOff #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The default 1s probe timeout is too aggressive for MinIO on EBS-backed storage. During startup, MinIO walks the /data filesystem to build its metadata index, starving the health endpoint of CPU. This caused repeated liveness probe failures (context deadline exceeded), SIGKILL (exit 137), and CrashLoopBackOff with 17+ restarts on vteam-uat. Changes: - livenessProbe: initialDelaySeconds 30→60, timeoutSeconds 1→5, periodSeconds 10→15 - readinessProbe: timeoutSeconds 1→5, periodSeconds 5→10 - Explicit failureThreshold: 3 on both probes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude Code ReviewSummaryThis PR fixes a production issue where MinIO pods enter CrashLoopBackOff due to overly aggressive health probe timeouts. The changes increase probe timeouts and intervals to accommodate slower I/O during startup on EBS-backed storage. The fix is targeted, well-documented, and production-tested. Recommendation: ✅ APPROVE with one minor enhancement suggestion Issues by Severity🚫 Blocker IssuesNone identified. 🔴 Critical IssuesNone identified. 🟡 Major Issues1. Missing SecurityContext on MinIO Deployment The MinIO deployment lacks a Location: Current state: spec:
containers:
- name: minio
image: quay.io/minio/minio:latest
# No securityContextRecommended (following the pattern from public-api-deployment.yaml): containers:
- name: minio
image: quay.io/minio/minio:latest
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
capabilities:
drop:
- ALL
# Note: readOnlyRootFilesystem cannot be used as MinIO needs write access to /dataRationale:
Impact: While not a blocker for this specific bug fix, this should be addressed in a follow-up PR to align with platform security standards. 🔵 Minor Issues1. Image Tag Uses Location: image: quay.io/minio/minio:latestConcern: Using
Recommendation: Pin to a specific MinIO version (e.g., Impact: Low priority for this PR (focused on probe fix), but worth tracking. Positive Highlights✅ Excellent problem diagnosis - Root cause analysis clearly explains the EBS I/O + metadata scan issue ✅ Well-documented changes - The PR body includes a detailed comparison table showing before/after probe settings ✅ Production-validated - Fix was tested on ✅ Conservative approach - Timeouts increased incrementally (1s → 5s), not excessively ✅ Explicit configuration - Making ✅ Follows Kubernetes best practices - Probe timing follows the pattern:
✅ Balanced probe strategy - Liveness probe is less aggressive than readiness probe (standard pattern) RecommendationsHigh Priority (Security)
Medium Priority (Stability)
Low Priority (Monitoring)
Code Quality Assessment
Final VerdictLGTM with suggestions - This PR successfully resolves a critical production stability issue. The probe timing adjustments are well-reasoned and tested. The only concern is the missing SecurityContext, which should be addressed (either in this PR or a follow-up) to maintain security consistency across the platform. Merge decision: Safe to merge as-is for immediate production stability. Create a follow-up issue to address SecurityContext hardening. 🤖 Review performed by Claude Code following CLAUDE.md security standards and Kubernetes best practices. 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
|
dupe #610 |
Summary
initialDelaySecondsfrom 30s to 60s — allows time for EBS attach + MinIO metadata scan before probing beginsfailureThreshold: 3explicit on both probes for clarityProbe Changes
initialDelaySecondstimeoutSecondsperiodSecondstimeoutSecondsperiodSecondsfailureThresholdRoot Cause
MinIO starts and binds ports successfully, but during startup I/O on the 50Gi EBS volume (metadata index scan of
/data), the/minio/health/liveendpoint can't respond within 1 second. After 3 consecutive failures, the kubelet kills the container (exit code 137). This produced 17+ restarts and CrashLoopBackOff onvteam-uat.Test plan
kustomize build components/manifests/overlays/productionrenders correctlyvteam-uatand confirm MinIO pod reaches1/1 Readywith zero restartsoc describe pod -l app=minio -n ambient-code | grep -A5 Liveness🤖 Generated with Claude Code