kubectl yconverge checks DAG and kustomize base traversal#76
kubectl yconverge checks DAG and kustomize base traversal#76
Conversation
kubectl plugin that wraps kustomize apply with idempotent converge-mode label routing (create, replace, serverside, serverside-force) and post-apply checks defined in yconverge.cue files using a CUE schema. Check types: #Wait (kubectl wait), #Rollout (rollout status), #Exec (arbitrary command with retry-until-timeout). Checks are defined per kustomization in a yconverge.cue file; the framework finds them via 1-level single-directory indirection through kustomization.yaml resources, ignoring sibling file resources. Dependency resolution walks CUE imports to build a topological apply order. Shared check definitions live in pure-CUE packages (no kustomization.yaml) that the dep walker ignores. Modes: apply (default), --diff=true, --checks-only, --print-deps. Apply modifiers: --dry-run=server|none, --skip-checks. Dry-run forwards to both kubectl apply and delete so replace-mode resources are provably non-mutating. Invalid flag combinations fail up front. Namespace for checks resolves from: -n CLI arg > outer kustomization namespace > indirected base namespace > context default. Exported as $NS_GUESS for exec checks alongside $CONTEXT. Error tolerance uses exact criteria: each kubectl step declares the specific error substrings it tolerates (AlreadyExists, no objects passed to apply, No resources found) — anything else surfaces raw. Integration tests run a kwok cluster in Docker with a fake node for pod scheduling. Covers: schema validation, dep resolution, indirection, converge-mode labels, broken-cue rejection, --skip-checks negative, replace-mode dry-run UID preservation, shared checks across db variants (single/distributed), and a PDB safety check demonstrating prod→qa failure detection. CI workflow renamed from "lint" to "checks" to reflect the itest job. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> s
Remove `up` and `namespaceGuess` from verify.#Step. Both were "set by the engine, not by user CUE files" — but the engine never set them either. `up` was designed for a CUE-native orchestrator where CUE's evaluation order needed a data dependency to serialize steps; the shell-based dep walker serializes via a for-loop instead. `namespaceGuess` is handled entirely as the shell variable $NS_GUESS. No yconverge.cue file in the repo references either field. New test: verify dependency checks serialize before downstream steps. Captures the multi-step output of example-with-dependency and asserts line ordering — namespace check completes before configmap step starts, configmap check completes before with-dependency step starts. This is the guarantee `up` was meant to provide, now proven by the shell execution model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provisioners (qemu, k3d) run kubectl yconverge for gateway-api and gateway before --skip-converge exit. Gateway API is infrastructure assumed present by all functional bases. Remove gateway imports from 29-y-kustomize and 20-gateway DAG. Keep all Traefik checks in 40-kafka-ystack — they verify the complete path kustomize uses for HTTP resources. Use -write instead of --ensure for /etc/hosts to fix stale entries from previous provisioner sessions. E2e: replace y-cluster-provision reprovision with explicit yconverge calls for monitoring and idempotency proof. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gateway step's /etc/hosts update runs before any HTTPRoutes exist. The y-kustomize step creates an HTTPRoute, so /etc/hosts needs updating afterward for kustomize HTTP resource resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace API proxy checks (kubectl get --raw .../proxy/...) with curl checks using the exact URL that kustomize HTTP resources reference: http://y-kustomize.ystack.svc.cluster.local/v1/.../base-for-annotations.yaml This is the path kustomize actually uses. If curl succeeds, kustomize will resolve the resource. The API proxy path has different failure modes (endpoint readiness timing) that don't predict kustomize success. 30-blobs-ystack: add blobs content check after restart (was missing). 40-kafka-ystack: kafka base gets 120s timeout (newly mounted secret), blobs base gets 60s (already mounted from previous step). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The y-k8s-ingress-hosts -write command replaces the managed block in /etc/hosts. When called before HTTPRoutes exist (during provisioning), it wrote an empty block — clearing previous entries. This caused curl checks to fail with "Could not resolve host" instead of the assumed secret propagation delay. Fix: skip -write when no ingress/gateway entries are found, preserving existing /etc/hosts entries from earlier steps. With /etc/hosts stable, y-kustomize restart + content availability takes ~4 seconds (secret volume is fresh on new pod). Reduce check timeouts from 120s to 30s. Root cause confirmed: Kubernetes secret volume mounts are instant on new pods. The 60-120s delay from docs applies only to volume UPDATES on running pods (kubelet sync interval). Restarts create new pods with fresh mounts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The new y-kustomize binary watches secrets labeled
yolean.se/module-part=y-kustomize via the Kubernetes API and serves
their content at /v1/{group}/{name}/{key}. Secret changes are
reflected instantly — no pod restart or kubelet volume sync needed.
This eliminates the dual-restart problem where the second restart
lost the first secret's volume mount for 60-120s due to kubelet's
sync interval.
Changes:
- y-kustomize/cmd/: Go binary with secret watch, HTTP server, tests
- y-kustomize/rbac.yaml: ServiceAccount + Role for secret list/watch
- y-kustomize/deployment.yaml: new image, removed volume mounts
- Secret labels: yolean.se/module-part changed from config to y-kustomize
- Init secrets get the label for consistent watch matching
- blobs-ystack/kafka-ystack: remove restart checks, keep content checks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
contain: Go binary from turbokube/contain releases, added to y-bin.runner.yaml with y-contain wrapper. y-kustomize build: contain.yaml: distroless/static:nonroot base, single Go binary layer skaffold.yaml: custom builder using go build + contain, OCI output No Docker required. No push for local dev. y-image-cache-load: add help section, fix lint warnings. Local workflow: cd y-kustomize/cmd go build + contain build → target-oci/ y-image-cache-load to get into cluster CI workflow: Same contain.yaml with --push for ghcr.io Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Init secrets get yolean.se/converge-mode: create label so re-converge doesn't overwrite secrets that have been populated by blobs-ystack or kafka-ystack. The watch-based y-kustomize reacts to secret content changes — empty secrets cause 404. y-cluster-local-ctr: add qemu case using SSH, matching the provisioner's existing SSH connection pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The watch-based y-kustomize reads secrets via the Kubernetes API. It doesn't need empty placeholder secrets to start — it starts with an empty file map and picks up secrets as they're created by blobs-ystack and kafka-ystack. Removes the init step and the dependency from 29-y-kustomize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds y-kustomize job to images workflow: go build + contain build --push to ghcr.io/yolean/y-kustomize:$SHA Temporarily triggers on y-converge-checks-dag branch pushes. Push will fail on YoleanAgents fork (no ghcr.io/yolean write access) but validates the build succeeds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ghcr.io/yolean/y-kustomize:c55953b69f74067043f2351f8727ea84db1737ca @sha256:e44f99f6bbae59aef485610402c8f3f0125e197fff8616643bd4d5c65ce619e1 Built by GHA images workflow. k3s pulls from ghcr.io on deploy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom builder: go build + contain tarball + ctr import into cluster. Deploy hook restarts y-kustomize after image load. No Docker daemon needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore env -i for acceptance test reproducibility. Registry rollout timeout increased to 120s — first deploy pulls the image from ghcr.io which can exceed 60s on cold cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The registry timeout was a transient issue, not a real problem. Restore clean env (env -i) for acceptance test reproducibility. e2e passes: 36/36 checks with clean env on fresh cluster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl-yconverge resolves k3s/ paths relative to cwd. Provisioners are called from other repos (checkit) where k3s/ doesn't exist. Use subshell cd to ensure correct path resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl writes contexts/clusters/users: null instead of [] when the last item is removed. kubie rejects this as invalid YAML. Fix by replacing null with empty list after context deletion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
y-cluster-converge-ystack accepts --converge=LIST (comma-separated base names without number prefix). Replaces the broken --exclude flag. Default: y-kustomize,blobs,builds-registry. Both provisioners pass --converge and --dry-run through. y-image-list-ystack and y-image-cache-ystack accept the same flag. The provisioner passes its converge targets so all images are pre-cached. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
y-registry-config reads magic ClusterIPs from the source-of-truth YAML files instead of using hostnames. Containerd resolves registries without /etc/hosts hacks on nodes. Qemu provisioner verifies registry access after converge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lint y-cluster-converge-ystack, y-image-list-ystack, and kubectl-yconverge with zero failures required before running integration tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NS_GUESS remains internal. Only NAMESPACE is exported to exec check commands. wait/rollout checks also use NAMESPACE as fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kustomize-traverse walks kustomization directory trees using the kustomize API types. Replaces the bash _find_cue_dir single-dir heuristic with full tree traversal. Checks from all bases are aggregated. Also used for namespace resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8d1dd15 to
fe5e0f8
Compare
| push: | ||
| branches: | ||
| - main | ||
| - y-converge-checks-dag |
There was a problem hiding this comment.
remove workflow test
| --converge=LIST comma-separated k3s bases to converge (default: y-kustomize,blobs,builds-registry) | ||
| --skip-converge skip converge and post-provision steps | ||
| --skip-image-load skip image cache and load into containerd | ||
| --dry-run=MODE forward to kubectl-yconverge (server|none) |
There was a problem hiding this comment.
--dry-run doesn't make sense for provision. There's --skip-converge for that.
| # Fix kubectl writing null instead of [] when last item is removed | ||
| sed -i 's/^contexts: null$/contexts: []/' "$KUBECONFIG" 2>/dev/null | ||
| sed -i 's/^clusters: null$/clusters: []/' "$KUBECONFIG" 2>/dev/null | ||
| sed -i 's/^users: null$/users: []/' "$KUBECONFIG" 2>/dev/null |
There was a problem hiding this comment.
We should remove this if only one provisioner does it. It's only kubie that can't handle null so we can consider that an issue with kubie.
| # Gateway API is always set up, even with --skip-converge. | ||
| export OVERRIDE_IP=${YSTACK_PORTS_IP:-127.0.0.1} | ||
| (cd "$YSTACK_HOME" && kubectl-yconverge --context=$CTX -k k3s/10-gateway-api/) | ||
| (cd "$YSTACK_HOME" && kubectl-yconverge --context=$CTX -k k3s/20-gateway/) |
There was a problem hiding this comment.
This way of calling yconverge is poor DX. If the script can run from any CWD we should use $YSTACK_HOME/k3s/20-gateway/ as the base path (or use a root derived from the script invocation). yconverge should not require that it's invoked from ystack root. Also it's a kubectl plugin so kubectl yconverge should work.
| set -e | ||
| YBIN="$(dirname $0)" | ||
|
|
||
| version=$(y-bin-download $YBIN/y-bin.optional.yaml contain) |
There was a problem hiding this comment.
does the declaration remain in optional? if so clean up.
| kubectl kustomize "$d" 2>/dev/null \ | ||
| | grep -oE 'image:\s*\S+' \ | ||
| | sed 's/image:[[:space:]]*//' \ | ||
| || true # y-script-lint:disable=or-true # kustomize may fail for bases requiring y-kustomize HTTP |
There was a problem hiding this comment.
This is not a good enough reason for ignoring errors. If you're talking about transient HTTP errors they're either rare and should propagate (if it's github, provision will likely fail anyway) or frequent in which case they should propagate too so we learn that we should redesign.
| grep '"yolean.se/ystack/' "$1" 2>/dev/null \ | ||
| | grep -v '"yolean.se/ystack/yconverge/verify"' \ | ||
| | sed 's|.*"yolean.se/ystack/\([^":]*\).*|\1|' \ | ||
| || true # y-script-lint:disable=or-true # no imports is valid |
There was a problem hiding this comment.
What other errors will this silently swallow? Do we have test coverage?
|
|
||
| if [ -z "$_YCONVERGE_RESOLVING" ] && [ -n "$KUSTOMIZE_DIR" ]; then | ||
| deps=$(_resolve_deps "$KUSTOMIZE_DIR") | ||
| dep_count=$(printf '%s\n' "$deps" | grep -c . 2>/dev/null) || true # y-script-lint:disable=or-true # grep -c . exit 1 = zero matches |
There was a problem hiding this comment.
find a better way to detect zero matches than to skip errors
| @@ -134,10 +142,7 @@ else | |||
| y-image-cache-load-all </dev/null || true | |||
There was a problem hiding this comment.
Why do we ignore errors here? Why didn't y-script-lint prevent this?
| echo "[y-cluster-provision-qemu] Loading images ..." | ||
| y-image-cache-ystack </dev/null | ||
| y-image-cache-ystack --converge=$CONVERGE_TARGETS </dev/null | ||
| y-image-cache-load-all </dev/null || true |
There was a problem hiding this comment.
- Remove workflow test changes from images.yaml - Remove --dry-run from provisioners (use y-cluster-converge-ystack directly) - Remove kubie null workaround from qemu teardown - Use absolute paths for yconverge calls (no cd to YSTACK_HOME) - y-image-list-ystack: let kustomize errors propagate - kubectl-yconverge: replace grep -c with wc -l, guard file existence in _find_imports, use || : for legitimate empty-string fallbacks - y-cluster-converge-ystack: use absolute paths in _resolve_target Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces #74.
y-cluster-provisionandy-cluster-converge-ystackNew
--converge=LISTflag replaces the broken--excludeflag.Specify which ystack bases to converge as a comma-separated list of
names without number prefix:
Available targets:
y-kustomize,blobs,builds-registry,kafka,buildkit,monitoring,prod-registry.Dependencies are resolved automatically — converging
builds-registrypulls in
blobs,y-kustomize, andkafka-ystack.New
--dry-run=serverpassthrough — verify what would be appliedwithout mutating the cluster:
Registry mirrors
Drops
/etc/hostshacks on nodes. Theregistries.yamlconfignow uses the magic ClusterIPs (
10.43.0.50for builds-registry,10.43.0.51for prod-registry) read from the source-of-truth YAMLfiles. Containerd resolves registries without needing DNS or host
file entries.
The qemu provisioner verifies registry access after converge:
Image caching
y-image-cache-ystackandy-image-list-ystacknow accept the same--converge=LISTflag. The provisioner passes its converge targetsthrough, so all images for the selected bases are pre-cached before
converge starts.
# List images that would be cached y-image-list-ystack --converge=y-kustomize,blobs,builds-registry,kafka,buildkitkubectl-yconverge
NAMESPACEexported to check commands: Exec checks inyconverge.cuecan use$NAMESPACE(the resolved namespace) and$CONTEXTin their commands.NS_GUESSis no longer exported.Multi-dir CUE aggregation via kustomize-traverse: The old 1-level
single-directory heuristic for finding
yconverge.cueis replaced bykustomize-traverse, which walks the full kustomization tree. Checksfrom all bases are collected and run after apply. This fixes the case
where
site-apply-namespaced/references multiple base directories.K3s version
Upgraded from v1.35.1+k3s1 to v1.35.3+k3s1.
New binary: kustomize-traverse
Added as a y-bin managed tool. Walks kustomization directory trees and
reports local directories visited and resolved namespace:
y-kustomize-traverse -o dirs gateway-v4/site-apply-namespaced/ y-kustomize-traverse -o namespace gateway-v4/site-apply-namespaced/Breaking changes
--excludeflag removed from provisioners (was never implementedin
y-cluster-converge-ystack)NS_GUESSno longer exported — use$NAMESPACEin check commandsy-image-list-ystackno longer reads aBASESarray fromy-cluster-converge-ystack— uses--convergeflag instead