Skip to content

Conversation

@jeremyeder
Copy link
Collaborator

@jeremyeder jeremyeder commented Feb 11, 2026

Summary

make kind-up on Apple Silicon crashed because the kind overlay pointed to amd64-only images. Fixed by switching to multi-arch images and adding a pre-flight architecture check.

  • Primary fix: Kind overlay (overlays/kind/operator-env-patch.yaml) pointed to quay.io/gkrumbach07/ (amd64-only) instead of quay.io/ambient_code/ (multi-arch). The state-sync init container's rclone binary (Go) hit a known Go runtime bug (lfstack.push invalid packing) under QEMU emulation on arm64.
  • E2E gap: State-sync was missing from the E2E workflow — not built, not loaded into kind.
  • Makefile gap: _build-and-load target (minikube path) omitted state-sync.
  • Prevention: New check-image-arch Make target runs as a kind-up prerequisite. On arm64 hosts, fails fast if the overlay references images outside quay.io/ambient_code/.
  • Docs: Streamlined local dev README (249 → 65 lines), added bottom-line summaries to README and kind guide, added arm64 troubleshooting entry.

Validation

Tested on Apple Silicon (arm64) Mac with existing kind cluster:

  1. Patched running operator to use quay.io/ambient_code/vteam_state_sync:latest
  2. Created AgenticSession in test namespace
  3. init-hydrate completed with exit code 0 — rclone connected to S3, repo cloned successfully
  4. Verified make check-image-arch passes with fixed overlay and fails with old gkrumbach07 registry

Test plan

  • make kind-up on Apple Silicon Mac — all pods start, state-sync pulls from quay.io/ambient_code/
  • Create AgenticSession — init-hydrate completes without crash
  • make check-image-arch passes on arm64
  • Simulate bad registry in overlay — make check-image-arch blocks with clear error
  • E2E CI picks up state-sync image (visible in workflow logs)

🤖 Generated with Claude Code

@github-actions

This comment has been minimized.

@jeremyeder jeremyeder force-pushed the bugfix/state-sync-arm64-compat branch from b8c16f9 to a1b81fe Compare February 11, 2026 20:18
`make kind-up` on Apple Silicon crashed because the kind overlay pointed
to amd64-only images at quay.io/gkrumbach07/. Fixed by switching to the
multi-arch images at quay.io/ambient_code/ and adding a pre-flight
architecture check so this class of bug is caught before deployment.

Root cause: The state-sync init container (rclone, a Go binary) hit a
known Go runtime bug (lfstack.push invalid packing) when an amd64 binary
ran under QEMU emulation on arm64. The CI build pipeline already produces
multi-arch images at quay.io/ambient_code/ — the kind overlay just wasn't
using them.

Changes:
- Fix kind overlay to use quay.io/ambient_code/ registry (primary fix)
- Add state-sync to E2E workflow (build/pull + kind load)
- Add state-sync to Makefile _build-and-load target (minikube path)
- Add check-image-arch Make target as kind-up prerequisite — on arm64
  hosts, fails fast if overlay references non-multi-arch images
- Streamline local dev docs: add bottom-line summary, remove duplicate
  Kind section, add arm64 troubleshooting entry

Validated: created AgenticSession on arm64 kind cluster, init-hydrate
completed successfully (exit 0, rclone connected to S3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jeremyeder jeremyeder force-pushed the bugfix/state-sync-arm64-compat branch from a1b81fe to e3dc83e Compare February 11, 2026 20:18
@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

Claude Code Review

Summary

This PR fixes a critical compatibility issue for Apple Silicon developers by updating Kind overlay images from amd64-only to multi-arch images. The changes include CI improvements, preventative checks, and documentation streamlining. The approach is methodical and well-executed.

Issues by Severity

🚫 Blocker Issues

None - PR is ready to merge.

🔴 Critical Issues

None identified.

🟡 Major Issues

None identified.

🔵 Minor Issues

1. E2E Workflow - State-sync build condition could be more accurate

# Line 162-163
if [ "${{ needs.detect-changes.outputs.claude-runner }}" == "true" ]; then
  echo "Building state-sync (changed)..."

Issue: State-sync is built when the claude-runner changes, but ideally it should check for state-sync directory changes specifically. This means state-sync might not rebuild when only components/runners/state-sync/ changes (if that directory exists).

Recommendation: If state-sync has its own directory under components/runners/, consider adding a dedicated change detection output in the workflow. If it's part of the same build context as claude-runner, this is acceptable as-is.

Impact: Low - The fallback to pulling latest means this won't break builds, just might miss cache optimization opportunities.


2. Documentation - Missing troubleshooting step verification

# Should show: quay.io/ambient_code/vteam_state_sync:latest
# If it shows gkrumbach07 or another registry, update the kind overlay and redeploy

Issue: The troubleshooting guide tells users to "update the kind overlay and redeploy" but doesn't provide the exact commands.

Recommendation: Add the specific fix commands:

# Fix: Edit components/manifests/overlays/kind/operator-env-patch.yaml
# Change STATE_SYNC_IMAGE to quay.io/ambient_code/vteam_state_sync:latest
# Then redeploy:
kubectl delete deployment agentic-operator -n ambient-code
make kind-deploy

Impact: Low - Developers can figure this out, but explicit commands improve UX.


3. Makefile - check-image-arch could validate all environment patches

BAD=$(grep -E 'value:.*quay\.io/' "$$OVERLAY" 2>/dev/null | grep -v 'quay\.io/ambient_code/' || true);

Issue: The check only validates the kind overlay, but similar issues could occur in minikube or other overlays.

Recommendation: Consider extending the check to all overlays or document that this is kind-specific:

# Option 1: Check all overlays
OVERLAYS=$(find components/manifests/overlays -name "*-env-patch.yaml");
for OVERLAY in $$OVERLAYS; do ... done

# Option 2: Add comment explaining scope
check-image-arch: ## Verify kind overlay images are multi-arch compatible with host

Impact: Low - Kind is the recommended approach, so this is sufficient for now.

Positive Highlights

Excellent preventative engineering - The check-image-arch Make target will catch this class of errors before they manifest. This is exactly the right approach: fail fast with clear error messages.

CI/E2E coverage improvement - Adding state-sync to the E2E workflow closes a gap that could have caused mysterious CI failures. Good defensive practice.

Documentation streamlining - Reducing README from 249 to 65 lines dramatically improves discoverability. The "bottom line" summary at the top is excellent UX. This follows the CLAUDE.md principle: "Default to improving existing documentation rather than creating new files."

Architecture-aware fix - The root cause analysis correctly identifies the QEMU emulation issue with amd64-only Go binaries on arm64. The fix (switch to multi-arch images) is the right architectural solution rather than a workaround.

Testing rigor - The validation section shows real testing on Apple Silicon hardware with actual AgenticSession creation and init container verification. This demonstrates thorough validation.

Follows project patterns - The Makefile changes follow established conventions (COLOR_* variables, check-* target naming, target prerequisites). The grep pattern for registry detection is appropriate for the use case.

Recommendations

Priority 1 (Pre-merge):

  • None - PR is ready to merge as-is.

Priority 2 (Follow-up, optional):

  1. Consider adding explicit redeploy commands to the kind.md troubleshooting section (see Minor Issue Epic: RAT Architecture & Design #2).
  2. Validate that state-sync image rebuilds correctly when only components/runners/state-sync/ changes in CI (see Minor Issue Outcome: Reduce Refinement Time with agent System #1).
  3. Document that check-image-arch is kind-specific or extend to other overlays if needed (see Minor Issue Epic: Data Source Integration #3).

Priority 3 (Future enhancement):

  • Consider adding a similar check for the E2E workflow itself to ensure it references quay.io/ambient_code/ images.

Overall Assessment:Approve and Merge

This PR demonstrates excellent engineering practices:

  • Root cause analysis leading to architectural fix
  • Preventative checks to avoid regression
  • CI gap closure
  • Documentation improvement
  • Thorough testing validation

The minor issues are truly minor and don't block merging. The core fix is sound, well-tested, and follows project standards.

🤖 Reviewed by Claude Sonnet 4.5 via Claude Code


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

Claude Code Review

Summary

This PR fixes a critical ARM64 compatibility issue that prevented make kind-up from working on Apple Silicon. The root cause was the kind overlay referencing AMD64-only images from quay.io/gkrumbach07/ instead of multi-arch images from quay.io/ambient_code/. The fix includes proper image registry updates, E2E workflow enhancements, and preventative checks. The approach is sound and aligns with project standards.

Issues by Severity

🚫 Blocker Issues

None identified. The PR is ready to merge.

🔴 Critical Issues

None identified.

🟡 Major Issues

1. Makefile: STATE_SYNC_IMAGE variable undefined

  • Location: Makefile:720, 724, 737
  • Issue: The Makefile references $(STATE_SYNC_IMAGE) but this variable is never defined in the Makefile
  • Impact: The build commands for state-sync will fail unless STATE_SYNC_IMAGE is set externally
  • Fix: Add STATE_SYNC_IMAGE ?= quay.io/ambient_code/vteam_state_sync:latest near the top of the Makefile where other image variables are defined (around line 50-60)

🔵 Minor Issues

1. Documentation: Inconsistent wording about platform support

  • Location: docs/developer/local-development/README.md:5
  • Issue: Table shows Kind as "Linux/macOS" but Kind works on Windows too (via Docker)
  • Recommendation: Change to "All" or add a footnote explaining Windows support via Docker

2. Documentation: "deprecated" vs "older" terminology

  • Location: docs/developer/local-development/README.md
  • Issue: README uses both "deprecated" and "older alternative" for Minikube
  • Recommendation: Pick one term consistently (suggest "deprecated" to be clear)

3. E2E Workflow: Detection logic for state-sync

  • Location: .github/workflows/e2e.yml:162
  • Issue: State-sync image build is triggered by needs.detect-changes.outputs.claude-runner instead of its own detection
  • Recommendation: Add specific detection for state-sync directory changes for better granularity (non-blocking, current logic works)

Positive Highlights

Excellent Root Cause Analysis - The commit message and PR description clearly document the QEMU emulation issue and Go runtime bug

Proper Testing - Validated on actual ARM64 hardware with real AgenticSession creation

Preventative Measures - The check-image-arch target is a great addition that prevents this class of bug from recurring

Documentation Quality - The streamlined README with "bottom line" summaries significantly improves usability

CI/CD Integration - State-sync properly added to E2E workflow with conditional build logic

Consistent Pattern - Image loading follows the same pattern as other components in both E2E and Makefile

Troubleshooting Addition - The new ARM64 troubleshooting section in kind.md is well-written and actionable

Recommendations

Priority 1: Fix STATE_SYNC_IMAGE variable (Major Issue)

Add this near the top of the Makefile (around line 50-60 where other image variables are defined):

STATE_SYNC_IMAGE ?= quay.io/ambient_code/vteam_state_sync:latest

Priority 2: Verify Makefile change works

After adding the variable, test the minikube path:

# This should work after the fix
make _build-and-load

Priority 3: Documentation polish (Optional)

Consider these minor improvements:

  • Standardize "deprecated" terminology for Minikube throughout README.md
  • Clarify Kind's Windows support status in the comparison table

Architectural Alignment

Follows CLAUDE.md standards - Build commands follow established patterns
Matches CI environment - Kind overlay properly uses quay.io/ambient_code/ registry
Multi-platform support - Aligns with CLAUDE.md Container Images standards (ARM64 support via PLATFORM=linux/arm64)
Documentation standards - Follows "improve existing docs" pattern from CLAUDE.md

Security Considerations

No security concerns. The changes:

  • Use public registry images (no credential exposure)
  • Add validation checks (improves security posture)
  • Follow established container image patterns

Performance Impact

Positive: ARM64 users now get native execution instead of QEMU emulation, significantly improving init-hydrate performance.

Final Verdict

APPROVED with minor fix required

The PR successfully resolves the ARM64 compatibility issue with proper root cause analysis, testing, and preventative measures. The only blocking item is defining the STATE_SYNC_IMAGE variable in the Makefile. Once that's added, this is ready to merge.

The documentation improvements are excellent and the check-image-arch safeguard is a great addition that prevents future regressions. Great work on both the fix and the comprehensive validation!


🤖 Review generated by Claude Sonnet 4.5 using repository memory system and coding standards


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

Correction to Review

I apologize for the error in my initial review. After re-checking the Makefile, I found that STATE_SYNC_IMAGE IS properly defined at line 61:

STATE_SYNC_IMAGE ?= vteam_state_sync:latest

Major Issue #1 is INVALID - Please disregard that section. The Makefile changes are correct and complete.

Updated Final Verdict

APPROVED

All issues identified were minor documentation recommendations only. The PR is ready to merge as-is. The STATE_SYNC_IMAGE variable is properly defined, and all Makefile changes follow established patterns correctly.

Sorry for the confusion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant