feat(operator): externalize session container resource limits to Conf… by jeremyeder · Pull Request #604 · ambient-code/platform

jeremyeder · 2026-02-10T01:53:57Z

This has to get rolled to both clusters. It's basic hygiene.
I pulled a must-gather and did an analysis. This change fixes almost everything.

@Gkrumbach07:

Move all session container resource requests/limits to env vars loaded from the operator-config ConfigMap, following the existing pattern used by AMBIENT_CODE_RUNNER_IMAGE and STATE_SYNC_IMAGE. Add resources to the ambient-content container (previously had none). Add a LimitRange guardrail for session namespaces to enforce floor values on any new container.

wdyt

…igMap AgenticSession pods were OOM-killed on vteam-stage (384Mi pod cgroup limit, Java sidecar needs ~389MB). Resource values were hardcoded in sessions.go making them impossible to tune without code changes. Move all session container resource requests/limits to env vars loaded from the operator-config ConfigMap, following the existing pattern used by AMBIENT_CODE_RUNNER_IMAGE and STATE_SYNC_IMAGE. Add resources to the ambient-content container (previously had none). Add a LimitRange guardrail for session namespaces to enforce floor values on any new container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-02-10T01:55:34Z

Claude Code Review

Summary

This PR externalizes session container resource limits from hardcoded values in Go code to ConfigMap-backed environment variables. This enables per-environment tuning without code changes. The implementation follows established patterns from the codebase and addresses a real production issue (OOM kills in vteam-stage).

Overall Assessment: ✅ APPROVE with minor suggestions

The core implementation is solid and follows project standards. There are a few areas for potential improvement around error handling and namespace isolation, but none are blockers.

Issues by Severity

🟡 Major Issues

1. LimitRange namespace isolation concern (sessions.go:1266-1270, sessions.go:2049-2092)

The ensureLimitRange function creates a LimitRange in the session namespace, but:

It's called for every session creation in the same namespace
It uses fixed hardcoded values that may not match the ConfigMap-driven values
The hardcoded limits in ensureLimitRange (sessions.go:2059-2073) don't use the ConfigMap values

Impact: LimitRange will have different values than the actual container resources if ConfigMap is customized.

Recommendation:

Pass the appConfig to ensureLimitRange and use ConfigMap values for LimitRange defaults
Or create LimitRange once per namespace (not per session) using a separate reconciliation loop
Or make LimitRange values configurable via environment variables like the container resources

Example fix:

func ensureLimitRange(namespace string, config *Config) error {
    // Use config values instead of hardcoded "500m", "2Gi", etc.
    lr := &corev1.LimitRange{
        // ... use config.RunnerCPURequest, config.RunnerMemoryRequest, etc.
    }
}

2. Error handling for resource parsing (sessions.go:856-862, sessions.go:1151-1157, sessions.go:1190-1196)

resource.MustParse() will panic if the ConfigMap contains invalid resource quantities. This violates the "Never Panic in Production Code" rule from CLAUDE.md:441.

Impact: Invalid ConfigMap values will crash the operator.

Recommendation: Validate resource strings at config load time or use non-panicking parsing:

// In config.go LoadConfig()
func validateResourceQuantity(key, value string) error {
    _, err := resource.ParseQuantity(value)
    if err != nil {
        return fmt.Errorf("invalid %s: %w", key, err)
    }
    return nil
}

// Or in sessions.go, use non-panicking parse
cpuRequest, err := resource.ParseQuantity(appConfig.RunnerCPURequest)
if err != nil {
    return fmt.Errorf("invalid RUNNER_CPU_REQUEST: %w", err)
}

🔵 Minor Issues

3. LimitRange warning is swallowed (sessions.go:1268)

The error is logged as a warning but doesn't prevent pod creation. This is fine, but consider if LimitRange creation failure should be a critical error since it's described as a "guardrail."

Current behavior:

if err := ensureLimitRange(sessionNamespace); err != nil {
    log.Printf("Warning: failed to ensure LimitRange in %s: %v", sessionNamespace, err)
}
// Pod creation continues even if LimitRange failed

Recommendation: Document why this is non-fatal, or consider making it fatal if LimitRanges are critical for cluster stability.

4. ConfigMap vs LimitRange mismatch in kind overlay (overlays/kind/operator-config.yaml:11-13)

The kind overlay reduces memory limits:

RUNNER_MEMORY_LIMIT: "1Gi"
CONTENT_MEMORY_LIMIT: "256Mi"  # Below LimitRange min of 1Gi!
STATE_SYNC_MEMORY_LIMIT: "256Mi"  # Below LimitRange min of 1Gi!

But the LimitRange (created by ensureLimitRange) has a minimum of 1Gi memory. This means these containers will be rejected by the API server in kind environments.

Impact: kind deployments may fail to create pods.

Recommendation:

Either adjust kind overlay to use at least 1Gi memory
Or make LimitRange minimums configurable via environment variables
Or skip LimitRange creation in kind environments

5. Missing documentation for new environment variables

The 12 new environment variables should be documented in:

components/operator/README.md (configuration section)
docs/configuration.md or equivalent
Deployment guides

Recommendation: Add a table documenting the new env vars, their defaults, and when to tune them.

Positive Highlights

✅ Follows established patterns: Uses envOrDefault helper pattern consistently with existing AMBIENT_CODE_RUNNER_IMAGE approach

✅ Addresses real production issue: OOM kills in vteam-stage (384Mi limit vs 389MB Java sidecar needs)

✅ Good default values: Sensible defaults in operator-deployment.yaml that work for most environments

✅ LimitRange guardrail: Adding LimitRange is a smart defense-in-depth approach to prevent misconfigured containers

✅ Kind overlay tuning: Provides lower resource limits for local development, reducing resource pressure on developer machines

✅ Clean code organization: Config fields grouped logically, clear naming convention

✅ Backward compatible: Uses defaults if env vars not set, doesn't break existing deployments

Recommendations

Priority 1 (before merge):

Fix resource.MustParse panic risk - validate at config load or use non-panicking parse
Fix kind overlay memory limits to respect LimitRange minimums (or make LimitRange configurable)

Priority 2 (follow-up):
3. Make LimitRange values configurable via environment variables to match ConfigMap-driven container resources
4. Add documentation for new environment variables
5. Consider validation tests for invalid resource quantities

Priority 3 (nice-to-have):
6. Add unit tests for envOrDefault function
7. Consider logging the loaded resource values on operator startup for debugging

Files Reviewed:

✅ components/operator/internal/config/config.go (new fields, envOrDefault helper)
✅ components/operator/internal/handlers/sessions.go (resource application, LimitRange creation)
✅ components/manifests/base/operator-deployment.yaml (env var defaults)
✅ components/manifests/base/session-limitrange.yaml (LimitRange template)
✅ components/manifests/overlays/kind/operator-config.yaml (kind overrides)

Standards Compliance:

✅ Go formatting (follows gofmt patterns)
✅ Error handling pattern (mostly - except MustParse)
⚠️ No panic in production (violated by resource.MustParse - needs fix)
✅ Configuration pattern (follows existing ConfigMap-backed env var approach)
✅ Security context (no changes to existing security settings)

🤖 Generated by Claude Code Review

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

CLAUDE.md - Master project instructions, development standards
backend-development.md - Go backend, K8s integration patterns
frontend-development.md - NextJS, Shadcn UI, React Query patterns
security-standards.md - Auth, RBAC, token handling
k8s-client-usage.md - User token vs service account patterns
error-handling.md - Consistent error patterns
react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(operator): externalize session container resource limits to Conf…#604

feat(operator): externalize session container resource limits to Conf…#604
jeremyeder wants to merge 1 commit intoambient-code:mainfrom
jeremyeder:feature/externalize-session-resource-limits

jeremyeder commented Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026 •

edited

Loading

What Amber Loaded for Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremyeder commented Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

Issues by Severity

🟡 Major Issues

🔵 Minor Issues

Positive Highlights

Recommendations

What Amber Loaded for Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 10, 2026 •

edited

Loading