Skip to content

kubectl yconverge checks DAG and kustomize base traversal#76

Open
solsson wants to merge 25 commits intomainfrom
y-converge-checks-dag
Open

kubectl yconverge checks DAG and kustomize base traversal#76
solsson wants to merge 25 commits intomainfrom
y-converge-checks-dag

Conversation

@solsson
Copy link
Copy Markdown
Collaborator

@solsson solsson commented Apr 22, 2026

Replaces #74.

y-cluster-provision and y-cluster-converge-ystack

New --converge=LIST flag replaces the broken --exclude flag.
Specify which ystack bases to converge as a comma-separated list of
names without number prefix:

# Default (minimal): y-kustomize, blobs, builds-registry
y-cluster-provision

# With kafka and buildkit
y-cluster-provision --converge=y-kustomize,blobs,builds-registry,kafka,buildkit

# Direct converge (same syntax)
y-cluster-converge-ystack --context=local --converge=kafka,builds-registry

Available targets: y-kustomize, blobs, builds-registry, kafka,
buildkit, monitoring, prod-registry.

Dependencies are resolved automatically — converging builds-registry
pulls in blobs, y-kustomize, and kafka-ystack.

New --dry-run=server passthrough — verify what would be applied
without mutating the cluster:

y-cluster-provision --converge=kafka --dry-run=server

Registry mirrors

Drops /etc/hosts hacks on nodes. The registries.yaml config
now uses the magic ClusterIPs (10.43.0.50 for builds-registry,
10.43.0.51 for prod-registry) read from the source-of-truth YAML
files. Containerd resolves registries without needing DNS or host
file entries.

The qemu provisioner verifies registry access after converge:

[y-cluster-provision-qemu] Verifying containerd registry access ...
  builds-registry: OK

Image caching

y-image-cache-ystack and y-image-list-ystack now accept the same
--converge=LIST flag. The provisioner passes its converge targets
through, so all images for the selected bases are pre-cached before
converge starts.

# List images that would be cached
y-image-list-ystack --converge=y-kustomize,blobs,builds-registry,kafka,buildkit

kubectl-yconverge

NAMESPACE exported to check commands: Exec checks in
yconverge.cue can use $NAMESPACE (the resolved namespace) and
$CONTEXT in their commands. NS_GUESS is no longer exported.

Multi-dir CUE aggregation via kustomize-traverse: The old 1-level
single-directory heuristic for finding yconverge.cue is replaced by
kustomize-traverse, which walks the full kustomization tree. Checks
from all bases are collected and run after apply. This fixes the case
where site-apply-namespaced/ references multiple base directories.

K3s version

Upgraded from v1.35.1+k3s1 to v1.35.3+k3s1.

New binary: kustomize-traverse

Added as a y-bin managed tool. Walks kustomization directory trees and
reports local directories visited and resolved namespace:

y-kustomize-traverse -o dirs gateway-v4/site-apply-namespaced/
y-kustomize-traverse -o namespace gateway-v4/site-apply-namespaced/

Breaking changes

  • --exclude flag removed from provisioners (was never implemented
    in y-cluster-converge-ystack)
  • NS_GUESS no longer exported — use $NAMESPACE in check commands
  • y-image-list-ystack no longer reads a BASES array from
    y-cluster-converge-ystack — uses --converge flag instead

solsson and others added 24 commits April 16, 2026 12:24
kubectl plugin that wraps kustomize apply with idempotent converge-mode
label routing (create, replace, serverside, serverside-force) and
post-apply checks defined in yconverge.cue files using a CUE schema.

Check types: #Wait (kubectl wait), #Rollout (rollout status), #Exec
(arbitrary command with retry-until-timeout). Checks are defined per
kustomization in a yconverge.cue file; the framework finds them via
1-level single-directory indirection through kustomization.yaml
resources, ignoring sibling file resources.

Dependency resolution walks CUE imports to build a topological apply
order. Shared check definitions live in pure-CUE packages (no
kustomization.yaml) that the dep walker ignores.

Modes: apply (default), --diff=true, --checks-only, --print-deps.
Apply modifiers: --dry-run=server|none, --skip-checks. Dry-run
forwards to both kubectl apply and delete so replace-mode resources
are provably non-mutating. Invalid flag combinations fail up front.

Namespace for checks resolves from: -n CLI arg > outer kustomization
namespace > indirected base namespace > context default. Exported as
$NS_GUESS for exec checks alongside $CONTEXT.

Error tolerance uses exact criteria: each kubectl step declares the
specific error substrings it tolerates (AlreadyExists, no objects
passed to apply, No resources found) — anything else surfaces raw.

Integration tests run a kwok cluster in Docker with a fake node for
pod scheduling. Covers: schema validation, dep resolution, indirection,
converge-mode labels, broken-cue rejection, --skip-checks negative,
replace-mode dry-run UID preservation, shared checks across db variants
(single/distributed), and a PDB safety check demonstrating prod→qa
failure detection.

CI workflow renamed from "lint" to "checks" to reflect the itest job.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

s
Remove `up` and `namespaceGuess` from verify.#Step. Both were
"set by the engine, not by user CUE files" — but the engine never
set them either. `up` was designed for a CUE-native orchestrator
where CUE's evaluation order needed a data dependency to serialize
steps; the shell-based dep walker serializes via a for-loop instead.
`namespaceGuess` is handled entirely as the shell variable $NS_GUESS.
No yconverge.cue file in the repo references either field.

New test: verify dependency checks serialize before downstream steps.
Captures the multi-step output of example-with-dependency and asserts
line ordering — namespace check completes before configmap step starts,
configmap check completes before with-dependency step starts. This is
the guarantee `up` was meant to provide, now proven by the shell
execution model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provisioners (qemu, k3d) run kubectl yconverge for gateway-api and
gateway before --skip-converge exit. Gateway API is infrastructure
assumed present by all functional bases.

Remove gateway imports from 29-y-kustomize and 20-gateway DAG.
Keep all Traefik checks in 40-kafka-ystack — they verify the
complete path kustomize uses for HTTP resources.

Use -write instead of --ensure for /etc/hosts to fix stale entries
from previous provisioner sessions.

E2e: replace y-cluster-provision reprovision with explicit yconverge
calls for monitoring and idempotency proof.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gateway step's /etc/hosts update runs before any HTTPRoutes exist.
The y-kustomize step creates an HTTPRoute, so /etc/hosts needs updating
afterward for kustomize HTTP resource resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace API proxy checks (kubectl get --raw .../proxy/...) with curl
checks using the exact URL that kustomize HTTP resources reference:
  http://y-kustomize.ystack.svc.cluster.local/v1/.../base-for-annotations.yaml

This is the path kustomize actually uses. If curl succeeds, kustomize
will resolve the resource. The API proxy path has different failure
modes (endpoint readiness timing) that don't predict kustomize success.

30-blobs-ystack: add blobs content check after restart (was missing).
40-kafka-ystack: kafka base gets 120s timeout (newly mounted secret),
  blobs base gets 60s (already mounted from previous step).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The y-k8s-ingress-hosts -write command replaces the managed block in
/etc/hosts. When called before HTTPRoutes exist (during provisioning),
it wrote an empty block — clearing previous entries. This caused curl
checks to fail with "Could not resolve host" instead of the assumed
secret propagation delay.

Fix: skip -write when no ingress/gateway entries are found, preserving
existing /etc/hosts entries from earlier steps.

With /etc/hosts stable, y-kustomize restart + content availability
takes ~4 seconds (secret volume is fresh on new pod). Reduce check
timeouts from 120s to 30s.

Root cause confirmed: Kubernetes secret volume mounts are instant on
new pods. The 60-120s delay from docs applies only to volume UPDATES
on running pods (kubelet sync interval). Restarts create new pods
with fresh mounts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The new y-kustomize binary watches secrets labeled
yolean.se/module-part=y-kustomize via the Kubernetes API and serves
their content at /v1/{group}/{name}/{key}. Secret changes are
reflected instantly — no pod restart or kubelet volume sync needed.

This eliminates the dual-restart problem where the second restart
lost the first secret's volume mount for 60-120s due to kubelet's
sync interval.

Changes:
- y-kustomize/cmd/: Go binary with secret watch, HTTP server, tests
- y-kustomize/rbac.yaml: ServiceAccount + Role for secret list/watch
- y-kustomize/deployment.yaml: new image, removed volume mounts
- Secret labels: yolean.se/module-part changed from config to y-kustomize
- Init secrets get the label for consistent watch matching
- blobs-ystack/kafka-ystack: remove restart checks, keep content checks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
contain: Go binary from turbokube/contain releases, added to
y-bin.runner.yaml with y-contain wrapper.

y-kustomize build:
  contain.yaml: distroless/static:nonroot base, single Go binary layer
  skaffold.yaml: custom builder using go build + contain, OCI output
  No Docker required. No push for local dev.

y-image-cache-load: add help section, fix lint warnings.

Local workflow:
  cd y-kustomize/cmd
  go build + contain build → target-oci/
  y-image-cache-load to get into cluster

CI workflow:
  Same contain.yaml with --push for ghcr.io

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Init secrets get yolean.se/converge-mode: create label so re-converge
doesn't overwrite secrets that have been populated by blobs-ystack
or kafka-ystack. The watch-based y-kustomize reacts to secret content
changes — empty secrets cause 404.

y-cluster-local-ctr: add qemu case using SSH, matching the provisioner's
existing SSH connection pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The watch-based y-kustomize reads secrets via the Kubernetes API.
It doesn't need empty placeholder secrets to start — it starts with
an empty file map and picks up secrets as they're created by
blobs-ystack and kafka-ystack.

Removes the init step and the dependency from 29-y-kustomize.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds y-kustomize job to images workflow:
  go build + contain build --push to ghcr.io/yolean/y-kustomize:$SHA

Temporarily triggers on y-converge-checks-dag branch pushes.
Push will fail on YoleanAgents fork (no ghcr.io/yolean write access)
but validates the build succeeds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ghcr.io/yolean/y-kustomize:c55953b69f74067043f2351f8727ea84db1737ca
@sha256:e44f99f6bbae59aef485610402c8f3f0125e197fff8616643bd4d5c65ce619e1

Built by GHA images workflow. k3s pulls from ghcr.io on deploy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom builder: go build + contain tarball + ctr import into cluster.
Deploy hook restarts y-kustomize after image load.
No Docker daemon needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore env -i for acceptance test reproducibility.

Registry rollout timeout increased to 120s — first deploy pulls
the image from ghcr.io which can exceed 60s on cold cache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The registry timeout was a transient issue, not a real problem.
Restore clean env (env -i) for acceptance test reproducibility.

e2e passes: 36/36 checks with clean env on fresh cluster.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl-yconverge resolves k3s/ paths relative to cwd. Provisioners
are called from other repos (checkit) where k3s/ doesn't exist.
Use subshell cd to ensure correct path resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl writes contexts/clusters/users: null instead of [] when the
last item is removed. kubie rejects this as invalid YAML. Fix by
replacing null with empty list after context deletion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
y-cluster-converge-ystack accepts --converge=LIST (comma-separated
base names without number prefix). Replaces the broken --exclude flag.
Default: y-kustomize,blobs,builds-registry. Both provisioners pass
--converge and --dry-run through.

y-image-list-ystack and y-image-cache-ystack accept the same flag.
The provisioner passes its converge targets so all images are pre-cached.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
y-registry-config reads magic ClusterIPs from the source-of-truth
YAML files instead of using hostnames. Containerd resolves registries
without /etc/hosts hacks on nodes. Qemu provisioner verifies registry
access after converge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lint y-cluster-converge-ystack, y-image-list-ystack, and
kubectl-yconverge with zero failures required before running
integration tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NS_GUESS remains internal. Only NAMESPACE is exported to exec check
commands. wait/rollout checks also use NAMESPACE as fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kustomize-traverse walks kustomization directory trees using the
kustomize API types. Replaces the bash _find_cue_dir single-dir
heuristic with full tree traversal. Checks from all bases are
aggregated. Also used for namespace resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@solsson solsson changed the title y-converge checks DAG and kustomize base traversal kubectl yconverge checks DAG and kustomize base traversal Apr 22, 2026
@solsson solsson force-pushed the y-converge-checks-dag branch from 8d1dd15 to fe5e0f8 Compare April 23, 2026 04:48
Comment thread .github/workflows/images.yaml Outdated
push:
branches:
- main
- y-converge-checks-dag
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove workflow test

Comment thread bin/y-cluster-provision-qemu Outdated
--converge=LIST comma-separated k3s bases to converge (default: y-kustomize,blobs,builds-registry)
--skip-converge skip converge and post-provision steps
--skip-image-load skip image cache and load into containerd
--dry-run=MODE forward to kubectl-yconverge (server|none)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--dry-run doesn't make sense for provision. There's --skip-converge for that.

Comment thread bin/y-cluster-provision-qemu Outdated
# Fix kubectl writing null instead of [] when last item is removed
sed -i 's/^contexts: null$/contexts: []/' "$KUBECONFIG" 2>/dev/null
sed -i 's/^clusters: null$/clusters: []/' "$KUBECONFIG" 2>/dev/null
sed -i 's/^users: null$/users: []/' "$KUBECONFIG" 2>/dev/null
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this if only one provisioner does it. It's only kubie that can't handle null so we can consider that an issue with kubie.

Comment thread bin/y-cluster-provision-k3d Outdated
# Gateway API is always set up, even with --skip-converge.
export OVERRIDE_IP=${YSTACK_PORTS_IP:-127.0.0.1}
(cd "$YSTACK_HOME" && kubectl-yconverge --context=$CTX -k k3s/10-gateway-api/)
(cd "$YSTACK_HOME" && kubectl-yconverge --context=$CTX -k k3s/20-gateway/)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of calling yconverge is poor DX. If the script can run from any CWD we should use $YSTACK_HOME/k3s/20-gateway/ as the base path (or use a root derived from the script invocation). yconverge should not require that it's invoked from ystack root. Also it's a kubectl plugin so kubectl yconverge should work.

Comment thread bin/y-contain
set -e
YBIN="$(dirname $0)"

version=$(y-bin-download $YBIN/y-bin.optional.yaml contain)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the declaration remain in optional? if so clean up.

Comment thread bin/y-image-list-ystack Outdated
kubectl kustomize "$d" 2>/dev/null \
| grep -oE 'image:\s*\S+' \
| sed 's/image:[[:space:]]*//' \
|| true # y-script-lint:disable=or-true # kustomize may fail for bases requiring y-kustomize HTTP
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good enough reason for ignoring errors. If you're talking about transient HTTP errors they're either rare and should propagate (if it's github, provision will likely fail anyway) or frequent in which case they should propagate too so we learn that we should redesign.

Comment thread bin/kubectl-yconverge Outdated
grep '"yolean.se/ystack/' "$1" 2>/dev/null \
| grep -v '"yolean.se/ystack/yconverge/verify"' \
| sed 's|.*"yolean.se/ystack/\([^":]*\).*|\1|' \
|| true # y-script-lint:disable=or-true # no imports is valid
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other errors will this silently swallow? Do we have test coverage?

Comment thread bin/kubectl-yconverge Outdated

if [ -z "$_YCONVERGE_RESOLVING" ] && [ -n "$KUSTOMIZE_DIR" ]; then
deps=$(_resolve_deps "$KUSTOMIZE_DIR")
dep_count=$(printf '%s\n' "$deps" | grep -c . 2>/dev/null) || true # y-script-lint:disable=or-true # grep -c . exit 1 = zero matches
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find a better way to detect zero matches than to skip errors

@@ -134,10 +142,7 @@ else
y-image-cache-load-all </dev/null || true
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore errors here? Why didn't y-script-lint prevent this?

echo "[y-cluster-provision-qemu] Loading images ..."
y-image-cache-ystack </dev/null
y-image-cache-ystack --converge=$CONVERGE_TARGETS </dev/null
y-image-cache-load-all </dev/null || true
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Remove workflow test changes from images.yaml
- Remove --dry-run from provisioners (use y-cluster-converge-ystack directly)
- Remove kubie null workaround from qemu teardown
- Use absolute paths for yconverge calls (no cd to YSTACK_HOME)
- y-image-list-ystack: let kustomize errors propagate
- kubectl-yconverge: replace grep -c with wc -l, guard file existence
  in _find_imports, use || : for legitimate empty-string fallbacks
- y-cluster-converge-ystack: use absolute paths in _resolve_target

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant