feat(operator): externalize session container resource limits to Conf…#604
feat(operator): externalize session container resource limits to Conf…#604jeremyeder wants to merge 1 commit intoambient-code:mainfrom
Conversation
…igMap AgenticSession pods were OOM-killed on vteam-stage (384Mi pod cgroup limit, Java sidecar needs ~389MB). Resource values were hardcoded in sessions.go making them impossible to tune without code changes. Move all session container resource requests/limits to env vars loaded from the operator-config ConfigMap, following the existing pattern used by AMBIENT_CODE_RUNNER_IMAGE and STATE_SYNC_IMAGE. Add resources to the ambient-content container (previously had none). Add a LimitRange guardrail for session namespaces to enforce floor values on any new container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude Code ReviewSummaryThis PR externalizes session container resource limits from hardcoded values in Go code to ConfigMap-backed environment variables. This enables per-environment tuning without code changes. The implementation follows established patterns from the codebase and addresses a real production issue (OOM kills in vteam-stage). Overall Assessment: ✅ APPROVE with minor suggestions The core implementation is solid and follows project standards. There are a few areas for potential improvement around error handling and namespace isolation, but none are blockers. Issues by Severity🟡 Major Issues1. LimitRange namespace isolation concern (sessions.go:1266-1270, sessions.go:2049-2092) The
Impact: LimitRange will have different values than the actual container resources if ConfigMap is customized. Recommendation:
Example fix: func ensureLimitRange(namespace string, config *Config) error {
// Use config values instead of hardcoded "500m", "2Gi", etc.
lr := &corev1.LimitRange{
// ... use config.RunnerCPURequest, config.RunnerMemoryRequest, etc.
}
}2. Error handling for resource parsing (sessions.go:856-862, sessions.go:1151-1157, sessions.go:1190-1196)
Impact: Invalid ConfigMap values will crash the operator. Recommendation: Validate resource strings at config load time or use non-panicking parsing: // In config.go LoadConfig()
func validateResourceQuantity(key, value string) error {
_, err := resource.ParseQuantity(value)
if err != nil {
return fmt.Errorf("invalid %s: %w", key, err)
}
return nil
}
// Or in sessions.go, use non-panicking parse
cpuRequest, err := resource.ParseQuantity(appConfig.RunnerCPURequest)
if err != nil {
return fmt.Errorf("invalid RUNNER_CPU_REQUEST: %w", err)
}🔵 Minor Issues3. LimitRange warning is swallowed (sessions.go:1268) The error is logged as a warning but doesn't prevent pod creation. This is fine, but consider if LimitRange creation failure should be a critical error since it's described as a "guardrail." Current behavior: if err := ensureLimitRange(sessionNamespace); err != nil {
log.Printf("Warning: failed to ensure LimitRange in %s: %v", sessionNamespace, err)
}
// Pod creation continues even if LimitRange failedRecommendation: Document why this is non-fatal, or consider making it fatal if LimitRanges are critical for cluster stability. 4. ConfigMap vs LimitRange mismatch in kind overlay (overlays/kind/operator-config.yaml:11-13) The kind overlay reduces memory limits: RUNNER_MEMORY_LIMIT: "1Gi"
CONTENT_MEMORY_LIMIT: "256Mi" # Below LimitRange min of 1Gi!
STATE_SYNC_MEMORY_LIMIT: "256Mi" # Below LimitRange min of 1Gi!But the LimitRange (created by Impact: kind deployments may fail to create pods. Recommendation:
5. Missing documentation for new environment variables The 12 new environment variables should be documented in:
Recommendation: Add a table documenting the new env vars, their defaults, and when to tune them. Positive Highlights✅ Follows established patterns: Uses ✅ Addresses real production issue: OOM kills in vteam-stage (384Mi limit vs 389MB Java sidecar needs) ✅ Good default values: Sensible defaults in operator-deployment.yaml that work for most environments ✅ LimitRange guardrail: Adding LimitRange is a smart defense-in-depth approach to prevent misconfigured containers ✅ Kind overlay tuning: Provides lower resource limits for local development, reducing resource pressure on developer machines ✅ Clean code organization: Config fields grouped logically, clear naming convention ✅ Backward compatible: Uses defaults if env vars not set, doesn't break existing deployments RecommendationsPriority 1 (before merge):
Priority 2 (follow-up): Priority 3 (nice-to-have): Files Reviewed:
Standards Compliance:
🤖 Generated by Claude Code Review 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
This has to get rolled to both clusters. It's basic hygiene.
I pulled a must-gather and did an analysis. This change fixes almost everything.
@Gkrumbach07:
Move all session container resource requests/limits to env vars loaded from the operator-config ConfigMap, following the existing pattern used by AMBIENT_CODE_RUNNER_IMAGE and STATE_SYNC_IMAGE. Add resources to the ambient-content container (previously had none). Add a LimitRange guardrail for session namespaces to enforce floor values on any new container.
wdyt