Skip to content

feat(operator): externalize session container resource limits to Conf…#604

Open
jeremyeder wants to merge 1 commit intoambient-code:mainfrom
jeremyeder:feature/externalize-session-resource-limits
Open

feat(operator): externalize session container resource limits to Conf…#604
jeremyeder wants to merge 1 commit intoambient-code:mainfrom
jeremyeder:feature/externalize-session-resource-limits

Conversation

@jeremyeder
Copy link
Collaborator

This has to get rolled to both clusters. It's basic hygiene.
I pulled a must-gather and did an analysis. This change fixes almost everything.

@Gkrumbach07:

Move all session container resource requests/limits to env vars loaded from the operator-config ConfigMap, following the existing pattern used by AMBIENT_CODE_RUNNER_IMAGE and STATE_SYNC_IMAGE. Add resources to the ambient-content container (previously had none). Add a LimitRange guardrail for session namespaces to enforce floor values on any new container.

wdyt

…igMap

AgenticSession pods were OOM-killed on vteam-stage (384Mi pod cgroup
limit, Java sidecar needs ~389MB). Resource values were hardcoded in
sessions.go making them impossible to tune without code changes.

Move all session container resource requests/limits to env vars loaded
from the operator-config ConfigMap, following the existing pattern used
by AMBIENT_CODE_RUNNER_IMAGE and STATE_SYNC_IMAGE. Add resources to the
ambient-content container (previously had none). Add a LimitRange
guardrail for session namespaces to enforce floor values on any new
container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2026

Claude Code Review

Summary

This PR externalizes session container resource limits from hardcoded values in Go code to ConfigMap-backed environment variables. This enables per-environment tuning without code changes. The implementation follows established patterns from the codebase and addresses a real production issue (OOM kills in vteam-stage).

Overall Assessment:APPROVE with minor suggestions

The core implementation is solid and follows project standards. There are a few areas for potential improvement around error handling and namespace isolation, but none are blockers.

Issues by Severity

🟡 Major Issues

1. LimitRange namespace isolation concern (sessions.go:1266-1270, sessions.go:2049-2092)

The ensureLimitRange function creates a LimitRange in the session namespace, but:

  • It's called for every session creation in the same namespace
  • It uses fixed hardcoded values that may not match the ConfigMap-driven values
  • The hardcoded limits in ensureLimitRange (sessions.go:2059-2073) don't use the ConfigMap values

Impact: LimitRange will have different values than the actual container resources if ConfigMap is customized.

Recommendation:

  • Pass the appConfig to ensureLimitRange and use ConfigMap values for LimitRange defaults
  • Or create LimitRange once per namespace (not per session) using a separate reconciliation loop
  • Or make LimitRange values configurable via environment variables like the container resources

Example fix:

func ensureLimitRange(namespace string, config *Config) error {
    // Use config values instead of hardcoded "500m", "2Gi", etc.
    lr := &corev1.LimitRange{
        // ... use config.RunnerCPURequest, config.RunnerMemoryRequest, etc.
    }
}

2. Error handling for resource parsing (sessions.go:856-862, sessions.go:1151-1157, sessions.go:1190-1196)

resource.MustParse() will panic if the ConfigMap contains invalid resource quantities. This violates the "Never Panic in Production Code" rule from CLAUDE.md:441.

Impact: Invalid ConfigMap values will crash the operator.

Recommendation: Validate resource strings at config load time or use non-panicking parsing:

// In config.go LoadConfig()
func validateResourceQuantity(key, value string) error {
    _, err := resource.ParseQuantity(value)
    if err != nil {
        return fmt.Errorf("invalid %s: %w", key, err)
    }
    return nil
}

// Or in sessions.go, use non-panicking parse
cpuRequest, err := resource.ParseQuantity(appConfig.RunnerCPURequest)
if err != nil {
    return fmt.Errorf("invalid RUNNER_CPU_REQUEST: %w", err)
}

🔵 Minor Issues

3. LimitRange warning is swallowed (sessions.go:1268)

The error is logged as a warning but doesn't prevent pod creation. This is fine, but consider if LimitRange creation failure should be a critical error since it's described as a "guardrail."

Current behavior:

if err := ensureLimitRange(sessionNamespace); err != nil {
    log.Printf("Warning: failed to ensure LimitRange in %s: %v", sessionNamespace, err)
}
// Pod creation continues even if LimitRange failed

Recommendation: Document why this is non-fatal, or consider making it fatal if LimitRanges are critical for cluster stability.

4. ConfigMap vs LimitRange mismatch in kind overlay (overlays/kind/operator-config.yaml:11-13)

The kind overlay reduces memory limits:

RUNNER_MEMORY_LIMIT: "1Gi"
CONTENT_MEMORY_LIMIT: "256Mi"  # Below LimitRange min of 1Gi!
STATE_SYNC_MEMORY_LIMIT: "256Mi"  # Below LimitRange min of 1Gi!

But the LimitRange (created by ensureLimitRange) has a minimum of 1Gi memory. This means these containers will be rejected by the API server in kind environments.

Impact: kind deployments may fail to create pods.

Recommendation:

  • Either adjust kind overlay to use at least 1Gi memory
  • Or make LimitRange minimums configurable via environment variables
  • Or skip LimitRange creation in kind environments

5. Missing documentation for new environment variables

The 12 new environment variables should be documented in:

  • components/operator/README.md (configuration section)
  • docs/configuration.md or equivalent
  • Deployment guides

Recommendation: Add a table documenting the new env vars, their defaults, and when to tune them.

Positive Highlights

Follows established patterns: Uses envOrDefault helper pattern consistently with existing AMBIENT_CODE_RUNNER_IMAGE approach

Addresses real production issue: OOM kills in vteam-stage (384Mi limit vs 389MB Java sidecar needs)

Good default values: Sensible defaults in operator-deployment.yaml that work for most environments

LimitRange guardrail: Adding LimitRange is a smart defense-in-depth approach to prevent misconfigured containers

Kind overlay tuning: Provides lower resource limits for local development, reducing resource pressure on developer machines

Clean code organization: Config fields grouped logically, clear naming convention

Backward compatible: Uses defaults if env vars not set, doesn't break existing deployments

Recommendations

Priority 1 (before merge):

  1. Fix resource.MustParse panic risk - validate at config load or use non-panicking parse
  2. Fix kind overlay memory limits to respect LimitRange minimums (or make LimitRange configurable)

Priority 2 (follow-up):
3. Make LimitRange values configurable via environment variables to match ConfigMap-driven container resources
4. Add documentation for new environment variables
5. Consider validation tests for invalid resource quantities

Priority 3 (nice-to-have):
6. Add unit tests for envOrDefault function
7. Consider logging the loaded resource values on operator startup for debugging


Files Reviewed:

  • ✅ components/operator/internal/config/config.go (new fields, envOrDefault helper)
  • ✅ components/operator/internal/handlers/sessions.go (resource application, LimitRange creation)
  • ✅ components/manifests/base/operator-deployment.yaml (env var defaults)
  • ✅ components/manifests/base/session-limitrange.yaml (LimitRange template)
  • ✅ components/manifests/overlays/kind/operator-config.yaml (kind overrides)

Standards Compliance:

  • ✅ Go formatting (follows gofmt patterns)
  • ✅ Error handling pattern (mostly - except MustParse)
  • ⚠️ No panic in production (violated by resource.MustParse - needs fix)
  • ✅ Configuration pattern (follows existing ConfigMap-backed env var approach)
  • ✅ Security context (no changes to existing security settings)

🤖 Generated by Claude Code Review


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant