fix: increase minio resources and relax health probes #610

jeremyeder · 2026-02-10T18:22:31Z

Summary

Increase minio CPU/memory requests (250m/512Mi → 2000m/4Gi) and limits (1000m/2Gi → 4000m/8Gi)
Relax liveness and readiness probes with higher initialDelaySeconds, failureThreshold: 5, and timeoutSeconds: 5 to tolerate slow minio startups

Test plan

Deploy to a cluster and verify minio pod starts without probe-triggered restarts
Confirm minio service becomes ready and accepts connections

🤖 Generated with Claude Code

Increase CPU/memory requests and limits to handle heavier workloads, and make liveness/readiness probes more tolerant of slow startups by adding timeoutSeconds, failureThreshold, and increasing delays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-02-10T18:23:48Z

Claude Code Review

Summary

PR #610 increases MinIO resource allocation and relaxes health probe settings to prevent probe-triggered restarts during slow startups. The changes are infrastructure-focused (Kubernetes deployment configuration) and do not involve code logic, authentication, or security patterns.

Issues by Severity

🚫 Blocker Issues

None found.

🔴 Critical Issues

None found.

🟡 Major Issues

1. Resource Allocation Increase - Justification Needed

Location: components/manifests/base/minio-deployment.yaml:76-81
Issue: CPU and memory increases are substantial:
- CPU requests: 250m → 2000m (8x increase)
- CPU limits: 1000m → 4000m (4x increase)
- Memory requests: 512Mi → 4Gi (~8x increase)
- Memory limits: 2Gi → 8Gi (4x increase)
Impact: This will significantly increase resource consumption per MinIO deployment.
Recommendation: Document the reasoning for these specific values. Were these based on observed resource usage patterns, load testing, or vendor recommendations? Consider whether these values are appropriate for all deployment scenarios (dev, staging, prod).

2. Probe Timing Changes - Risk of Delayed Failure Detection

Location: components/manifests/base/minio-deployment.yaml:62-73
Issue:
- Liveness probe: initialDelaySeconds: 30 → 60, periodSeconds: 10 → 15
- Readiness probe: initialDelaySeconds: 10 → 20, periodSeconds: 5 → 10
- Both probes: Added failureThreshold: 5
Impact: Longer detection times for failures:
- Liveness failure detection: 30s + 3×10s = 60s (old) → 60s + 5×15s = 135s (new)
- Readiness failure detection: 10s + 3×5s = 25s (old) → 20s + 5×10s = 70s (new)
Recommendation: While these changes prevent false positives during slow startups, they also increase the time to detect legitimate failures. Consider whether failureThreshold: 5 is too high—could failureThreshold: 3 be sufficient?

🔵 Minor Issues

1. Missing Documentation Update

Issue: No documentation explaining when these higher resource values are appropriate.
Recommendation: Add a comment in the YAML file or update components/manifests/README.md with guidance on:
- When to use these resource values vs. lower values
- How to adjust for different deployment sizes
- Observed MinIO resource usage patterns that motivated these changes

2. Environment-Specific Configuration Consideration

Issue: These values are in the base deployment, so they'll apply to all environments.
Recommendation: Consider using Kustomize overlays to allow different resource values for dev/staging/prod environments. Not blocking for this PR, but worth considering for future work.

Positive Highlights

✅ Appropriate Response to Operational Issues

Addresses real-world startup issues with probe-triggered restarts
Follows Kubernetes best practices by tuning probes rather than disabling them

✅ Conservative Probe Tuning

Added timeoutSeconds: 5 to both probes (prevents timeout issues)
Used reasonable failureThreshold values to tolerate transient failures

✅ Clear PR Description

Well-documented changes with specific before/after values
Includes test plan (though not yet executed per checklist)

✅ Follows Git Workflow Standards

Feature branch name follows convention
Commit message follows conventional commits style
PR title uses proper fix: prefix

Recommendations

Priority 1 (Address Before Merge):

Execute Test Plan: Check the box in the PR description after verifying:
- MinIO pod starts without probe-triggered restarts
- MinIO service becomes ready and accepts connections
- Confirm the pod actually uses these resources (check metrics)

Justify Resource Values: Add a comment explaining:

What workload/scenario requires 2000m CPU and 4Gi memory?
Were these values based on production metrics or load testing?
Example comment to add above resources section:

# Resource values tuned for [describe scenario]:
# - Based on observed peak usage of [X] during [Y]
# - CPU: 2000m request handles [Z] concurrent operations
# - Memory: 4Gi handles [N] buckets with [M] objects

Priority 2 (Consider for Follow-up):

Environment-Specific Values: Create Kustomize overlays:
- overlays/dev/: Lower resource values for local development
- overlays/prod/: Current higher values for production
Monitoring: Add Prometheus metrics/alerts for:
- MinIO CPU/memory usage vs. limits
- Probe failure rates
- Startup duration
Documentation: Update docs/deployment/ with MinIO sizing guidance.

Overall Assessment: This PR appropriately addresses operational issues with MinIO startup reliability. The changes are infrastructure-focused and don't raise security or code quality concerns. The main recommendation is to document the justification for these specific resource values and complete the test plan before merging.

Suggested Action: Execute the test plan, add a comment documenting the rationale for resource values, then merge.

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

CLAUDE.md - Master project instructions, development standards
backend-development.md - Go backend, K8s integration patterns
frontend-development.md - NextJS, Shadcn UI, React Query patterns
security-standards.md - Auth, RBAC, token handling
k8s-client-usage.md - User token vs service account patterns
error-handling.md - Consistent error patterns
react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

jeremyeder merged commit e9f3d45 into ambient-code:main Feb 10, 2026
13 checks passed

jeremyeder mentioned this pull request Feb 10, 2026

Fix MinIO health probe timeouts causing CrashLoopBackOff #609

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: increase minio resources and relax health probes #610

fix: increase minio resources and relax health probes #610

Uh oh!

jeremyeder commented Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026 •

edited

Loading

What Amber Loaded for Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: increase minio resources and relax health probes #610

fix: increase minio resources and relax health probes #610

Uh oh!

Conversation

jeremyeder commented Feb 10, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

Issues by Severity

🚫 Blocker Issues

🔴 Critical Issues

🟡 Major Issues

🔵 Minor Issues

Positive Highlights

Recommendations

Priority 1 (Address Before Merge):

Priority 2 (Consider for Follow-up):

What Amber Loaded for Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 10, 2026 •

edited

Loading