-
Notifications
You must be signed in to change notification settings - Fork 49
fix: state-sync arm64 compatibility across CI and local dev #618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: state-sync arm64 compatibility across CI and local dev #618
Conversation
This comment has been minimized.
This comment has been minimized.
b8c16f9 to
a1b81fe
Compare
`make kind-up` on Apple Silicon crashed because the kind overlay pointed to amd64-only images at quay.io/gkrumbach07/. Fixed by switching to the multi-arch images at quay.io/ambient_code/ and adding a pre-flight architecture check so this class of bug is caught before deployment. Root cause: The state-sync init container (rclone, a Go binary) hit a known Go runtime bug (lfstack.push invalid packing) when an amd64 binary ran under QEMU emulation on arm64. The CI build pipeline already produces multi-arch images at quay.io/ambient_code/ — the kind overlay just wasn't using them. Changes: - Fix kind overlay to use quay.io/ambient_code/ registry (primary fix) - Add state-sync to E2E workflow (build/pull + kind load) - Add state-sync to Makefile _build-and-load target (minikube path) - Add check-image-arch Make target as kind-up prerequisite — on arm64 hosts, fails fast if overlay references non-multi-arch images - Streamline local dev docs: add bottom-line summary, remove duplicate Kind section, add arm64 troubleshooting entry Validated: created AgenticSession on arm64 kind cluster, init-hydrate completed successfully (exit 0, rclone connected to S3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a1b81fe to
e3dc83e
Compare
Claude Code ReviewSummaryThis PR fixes a critical compatibility issue for Apple Silicon developers by updating Kind overlay images from amd64-only to multi-arch images. The changes include CI improvements, preventative checks, and documentation streamlining. The approach is methodical and well-executed. Issues by Severity🚫 Blocker IssuesNone - PR is ready to merge. 🔴 Critical IssuesNone identified. 🟡 Major IssuesNone identified. 🔵 Minor Issues1. E2E Workflow - State-sync build condition could be more accurate # Line 162-163
if [ "${{ needs.detect-changes.outputs.claude-runner }}" == "true" ]; then
echo "Building state-sync (changed)..."Issue: State-sync is built when the Recommendation: If Impact: Low - The fallback to pulling 2. Documentation - Missing troubleshooting step verification # Should show: quay.io/ambient_code/vteam_state_sync:latest
# If it shows gkrumbach07 or another registry, update the kind overlay and redeployIssue: The troubleshooting guide tells users to "update the kind overlay and redeploy" but doesn't provide the exact commands. Recommendation: Add the specific fix commands: # Fix: Edit components/manifests/overlays/kind/operator-env-patch.yaml
# Change STATE_SYNC_IMAGE to quay.io/ambient_code/vteam_state_sync:latest
# Then redeploy:
kubectl delete deployment agentic-operator -n ambient-code
make kind-deployImpact: Low - Developers can figure this out, but explicit commands improve UX. 3. Makefile - BAD=$(grep -E 'value:.*quay\.io/' "$$OVERLAY" 2>/dev/null | grep -v 'quay\.io/ambient_code/' || true);Issue: The check only validates the kind overlay, but similar issues could occur in minikube or other overlays. Recommendation: Consider extending the check to all overlays or document that this is kind-specific: # Option 1: Check all overlays
OVERLAYS=$(find components/manifests/overlays -name "*-env-patch.yaml");
for OVERLAY in $$OVERLAYS; do ... done
# Option 2: Add comment explaining scope
check-image-arch: ## Verify kind overlay images are multi-arch compatible with hostImpact: Low - Kind is the recommended approach, so this is sufficient for now. Positive Highlights✅ Excellent preventative engineering - The ✅ CI/E2E coverage improvement - Adding state-sync to the E2E workflow closes a gap that could have caused mysterious CI failures. Good defensive practice. ✅ Documentation streamlining - Reducing README from 249 to 65 lines dramatically improves discoverability. The "bottom line" summary at the top is excellent UX. This follows the CLAUDE.md principle: "Default to improving existing documentation rather than creating new files." ✅ Architecture-aware fix - The root cause analysis correctly identifies the QEMU emulation issue with amd64-only Go binaries on arm64. The fix (switch to multi-arch images) is the right architectural solution rather than a workaround. ✅ Testing rigor - The validation section shows real testing on Apple Silicon hardware with actual AgenticSession creation and init container verification. This demonstrates thorough validation. ✅ Follows project patterns - The Makefile changes follow established conventions (COLOR_* variables, check-* target naming, target prerequisites). The grep pattern for registry detection is appropriate for the use case. RecommendationsPriority 1 (Pre-merge):
Priority 2 (Follow-up, optional):
Priority 3 (Future enhancement):
Overall Assessment: ✅ Approve and Merge This PR demonstrates excellent engineering practices:
The minor issues are truly minor and don't block merging. The core fix is sound, well-tested, and follows project standards. 🤖 Reviewed by Claude Sonnet 4.5 via Claude Code 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Claude Code ReviewSummaryThis PR fixes a critical ARM64 compatibility issue that prevented Issues by Severity🚫 Blocker IssuesNone identified. The PR is ready to merge. 🔴 Critical IssuesNone identified. 🟡 Major Issues1. Makefile: STATE_SYNC_IMAGE variable undefined
🔵 Minor Issues1. Documentation: Inconsistent wording about platform support
2. Documentation: "deprecated" vs "older" terminology
3. E2E Workflow: Detection logic for state-sync
Positive Highlights✅ Excellent Root Cause Analysis - The commit message and PR description clearly document the QEMU emulation issue and Go runtime bug ✅ Proper Testing - Validated on actual ARM64 hardware with real AgenticSession creation ✅ Preventative Measures - The ✅ Documentation Quality - The streamlined README with "bottom line" summaries significantly improves usability ✅ CI/CD Integration - State-sync properly added to E2E workflow with conditional build logic ✅ Consistent Pattern - Image loading follows the same pattern as other components in both E2E and Makefile ✅ Troubleshooting Addition - The new ARM64 troubleshooting section in kind.md is well-written and actionable RecommendationsPriority 1: Fix STATE_SYNC_IMAGE variable (Major Issue)Add this near the top of the Makefile (around line 50-60 where other image variables are defined): STATE_SYNC_IMAGE ?= quay.io/ambient_code/vteam_state_sync:latestPriority 2: Verify Makefile change worksAfter adding the variable, test the minikube path: # This should work after the fix
make _build-and-loadPriority 3: Documentation polish (Optional)Consider these minor improvements:
Architectural Alignment✅ Follows CLAUDE.md standards - Build commands follow established patterns Security ConsiderationsNo security concerns. The changes:
Performance ImpactPositive: ARM64 users now get native execution instead of QEMU emulation, significantly improving init-hydrate performance. Final VerdictAPPROVED with minor fix required ✅ The PR successfully resolves the ARM64 compatibility issue with proper root cause analysis, testing, and preventative measures. The only blocking item is defining the The documentation improvements are excellent and the 🤖 Review generated by Claude Sonnet 4.5 using repository memory system and coding standards 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Correction to ReviewI apologize for the error in my initial review. After re-checking the Makefile, I found that STATE_SYNC_IMAGE IS properly defined at line 61: STATE_SYNC_IMAGE ?= vteam_state_sync:latestMajor Issue #1 is INVALID - Please disregard that section. The Makefile changes are correct and complete. Updated Final VerdictAPPROVED ✅ All issues identified were minor documentation recommendations only. The PR is ready to merge as-is. The STATE_SYNC_IMAGE variable is properly defined, and all Makefile changes follow established patterns correctly. Sorry for the confusion! |
Summary
make kind-upon Apple Silicon crashed because the kind overlay pointed to amd64-only images. Fixed by switching to multi-arch images and adding a pre-flight architecture check.overlays/kind/operator-env-patch.yaml) pointed toquay.io/gkrumbach07/(amd64-only) instead ofquay.io/ambient_code/(multi-arch). The state-sync init container's rclone binary (Go) hit a known Go runtime bug (lfstack.push invalid packing) under QEMU emulation on arm64._build-and-loadtarget (minikube path) omitted state-sync.check-image-archMake target runs as akind-upprerequisite. On arm64 hosts, fails fast if the overlay references images outsidequay.io/ambient_code/.Validation
Tested on Apple Silicon (arm64) Mac with existing kind cluster:
quay.io/ambient_code/vteam_state_sync:latesttestnamespaceinit-hydratecompleted with exit code 0 — rclone connected to S3, repo cloned successfullymake check-image-archpasses with fixed overlay and fails with oldgkrumbach07registryTest plan
make kind-upon Apple Silicon Mac — all pods start, state-sync pulls fromquay.io/ambient_code/init-hydratecompletes without crashmake check-image-archpasses on arm64make check-image-archblocks with clear error🤖 Generated with Claude Code