Skip to content

Formalize checks for kubectl yconverge#74

Open
solsson wants to merge 37 commits intomainfrom
converge-dag
Open

Formalize checks for kubectl yconverge#74
solsson wants to merge 37 commits intomainfrom
converge-dag

Conversation

@solsson
Copy link
Copy Markdown
Collaborator

@solsson solsson commented Apr 16, 2026

No description provided.

Yolean k8s-qa and others added 30 commits April 2, 2026 10:19
Restores CUE CLI that was in ystack from 2022-2023 (removed during
y-bin.yaml split). Available as y-cue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --keep-disk flag to preserve the disk image for faster re-provision.
Without it, teardown now removes the qcow2 disk for clean e2e runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generic, POSIX-compatible convergence plugin supporting five modes
via the yolean.se/converge-mode label:
  (none)             standard kubectl apply
  create             kubectl create --save-config (skip if exists)
  replace            kubectl delete + apply (for immutable resources)
  serverside         kubectl apply --server-side
  serverside-force   kubectl apply --server-side --force-conflicts

Flags: --context= (required), --diff=true, --dry-run=true.
Honors KUBECONFIG. Handles empty label selections gracefully.

Moved from checkit where it had a hardcoded --server-side path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the approach for replacing implicit script ordering with
CUE import-based dependency resolution and kubectl yconverge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Schema (cue/converge/schema.cue):
  #Step, #Check (#Wait, #Rollout, #Exec), #Action

Module declarations (k3s/*/y-k8s.cue):
  17 modules with typed dependencies, checks, and actions.
  Dependencies expressed as CUE imports — the import graph IS the DAG.

Engine (converge_tool.cue):
  `y-cue cmd converge -t context=local -t path=$PATH`
  Prints human-readable converge plan before execution.
  Delegates all applies to kubectl-yconverge.
  Translates checks to kubectl wait/rollout/exec commands.

Also: add serverside-force label to k3s/10-gateway-api/kustomization.yaml
so kubectl-yconverge handles CRD apply correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bash convergence script now delegates to:
  y-cue cmd converge ./cue/converge-ystack/

Sequential task chaining ensures steps execute in DAG order.
Override-ip, KUBECONFIG, and PATH are passed as CUE tags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two proposed design additions:
- yconverge.cue: auto-invoke checks from kubectl-yconverge on exit 0
- y-kustomize refresh: hash-based restart tracking via annotation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename all y-k8s.cue to yconverge.cue (must be next to kustomization.yaml).

kubectl-yconverge now auto-invokes checks from yconverge.cue after
successful apply. Supports one level of resources: indirection for
finding yconverge.cue in referenced directories.

Add --skip-checks flag for batch operations (used by CUE engine which
runs its own checks). Checks are also skipped for --dry-run and --diff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integration tests (cue/itest/) using kwok as lightweight test cluster:
  17 tests covering schema validation, auto-checks, dependency preconditions,
  transitive deps, disabled steps, resources indirection, idempotency,
  error reporting, and --skip-checks.

kubectl-yconverge now:
  - Finds yconverge.cue next to kustomization.yaml (or one level indirection)
  - Runs step.prechecks BEFORE apply (dependency checks)
  - Runs step.checks AFTER apply (this step's verification)
  - Skips entirely when enabled: false

Schema adds prechecks field to #Step for explicit dependency check export.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each -k runs the full cycle (prechecks -> apply -> checks) sequentially.
Other args like -n are passed to every apply. This enables:

  kubectl-yconverge --context=local \
    -k cluster-local/00-cluster/ \
    -k cluster-local/mysql/ \
    -k keycloak-v3/topic-events/

Reduces repetition in provisioning scripts.
Adds itest coverage for multi -k (19/19 tests pass).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
runner.Dockerfile: add y-cue binary download and version check.

lint.yaml: add itest job that runs cue/itest/test.sh with kwok
as a lightweight test cluster. Runs on push to main and PRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests serverside-force label where create/delete/regular selectors
match nothing. Verifies _apply_if_any handles empty results gracefully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The enabled field belongs to the DAG engine's step filtering, not to
the apply tool. When someone runs kubectl yconverge -k dir/, they're
explicitly asking to apply — refusing is confusing.

enabled:false remains in the CUE schema for the DAG engine.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensure y-cue, y-yq, kubectl are available before running tests.
Fixes CUE vet failures in GHA where binaries weren't pre-downloaded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
y-bin-download needs YSTACK_HOME to find the bin directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Schema changes:
- Remove kustomization (implicit from yconverge.cue placement)
- Remove enabled, actions, prechecks from #Step
- Keep checks and up (core DAG field)
- Keep namespace on #Wait/#Rollout (namespace handling TBD)

Remove prechecks from kubectl-yconverge. Move y-kustomize restart
from action to exec check in 30-blobs-ystack and 40-kafka-ystack.

Namespace on modeled checks is kept for now. Proper namespace
propagation behavior needs specification and test cases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
namespaceGuess is set by the engine (like up), not by user CUE files.
Priority: -n CLI arg > kustomization.yaml namespace: > context default.

Used as fallback for #Wait/#Rollout checks that omit namespace.
Namespace test cases TBD.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When indirection is in effect (yconverge.cue found in a referenced
resource directory), the referenced base's kustomization.yaml
namespace: field is checked before the current directory's.

Priority order is now:
  1. -n CLI arg
  2. referenced base namespace (indirection)
  3. kustomization.yaml namespace:
  4. context default namespace

Updated indirection itest to use namespaced resource (configmap in
itest namespace) instead of cluster-scoped (namespace). Verifies
namespace resolution from referenced base.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If a yconverge.cue file is found, y-cue eval must succeed.
A broken file now produces a clear error instead of silently
falling back to no checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each test is echo + kubectl-yconverge. Negative tests negate with !.
No pass/fail counters — set -eo pipefail stops on first failure.

Adds negative test for broken yconverge.cue (must fail, not skip).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assert on stdout/stderr using tee to tmp file + grep:
- Multi -k shows exactly 3 yconverge.cue discoveries
- Indirection output references the base directory path
- --skip-checks produces no [yconverge] output
- Broken yconverge.cue produces ERROR message

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Simplifies arg parsing and control flow. Each invocation handles
one -k directory. Callers use separate calls for multiple steps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
60-builds-registry: kustomization has namespace: ystack, so rollout
check uses namespaceGuess instead of explicit namespace.

11-monitoring-operator: targets default namespace which is what
namespaceGuess resolves to.

Fix: kubectl config view for namespace resolution must not fail
under set -e when context has no default namespace.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl-yconverge now resolves the dependency tree from yconverge.cue
imports on first invocation. Each dependency is converged by calling
kubectl-yconverge recursively (with _YCONVERGE_RESOLVING=1 to prevent
infinite recursion).

New: bin/y-yconverge-deps — extracts topological order from CUE imports.

Removed: cue/converge-ystack/ — the CUE runtime engine is replaced by
dependency resolution in kubectl-yconverge itself.

y-cluster-converge-ystack simplified to four leaf-target calls.
Each resolves its own dependency chain. Shared deps are idempotent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cue/converge/ -> yconverge/converge/
cue/itest/ -> yconverge/itest/

All imports updated from yolean.se/ystack/cue/converge
to yolean.se/ystack/yconverge/converge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yconverge/converge/ -> yconverge/verify/
package converge -> package verify

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
--keep uses /tmp/ystack-yconverge-itest (stable, reusable).
Without --keep uses a temp file deleted on cleanup.
Prints KUBECONFIG path with --keep for manual inspection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set current-context, create a named user entry, and reference it
from the context. Fixes kubie which requires non-null users list
and a current-context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Yolean k8s-qa and others added 7 commits April 5, 2026 14:28
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yconverge: verify exact criteria before swallowing errors

_apply_if_any tolerated any *"not found"* substring, which silently
masked real apply failures like `namespaces "X" not found`. Replaced
with _kubectl_tolerant(allowed_patterns, kubectl args) that tolerates
only an explicit |-separated list. Each call site names the exact
expected errors:
- create --save-config: AlreadyExists, no objects passed to create
- apply variants: no objects passed to apply
- delete --selector: strict (kubectl exits 0 on empty match)

Also stripped blanket 2>/dev/null masks from y-yq calls, kubectl
config view, and y-yconverge-deps invocation. Restructured _find_cue
and _find_yconverge_dir so "no cue file" is empty output + exit 0
while real parse errors propagate via set -e.

itest: cover cluster-prod/db, create namespace in test driver

Adds a cluster-prod/db step to the itest, creating the db namespace
up front since bases intentionally don't carry a Namespace (keeps
delete -k reversible). Also tightens docker rm cleanup to verify
"No such container" before swallowing.

db-service: add ports and clusterIP so kustomize renders a valid
headless Service.

yconverge: label internal kubectl output so users can attribute it

Every internal step (create, delete replace-mode, 3 apply variants) now
runs through _kubectl_step which prefixes each output line with
"  [yconverge] <step>: " — matching the existing "[yconverge] found"
and "[yconverge] check:" convention. Expected "nothing to do" stdout
(e.g. delete's "No resources found") is silenced, so a clean re-apply
only shows lines the user cares about.

Renames _kubectl_tolerant → _kubectl_step and adds a third arg for
|-separated success-stdout substrings to suppress.

itest: --skip-checks negative test now greps for "[yconverge] check:"
and "[yconverge] found" specifically, since plain apply output also
carries the "[yconverge]" prefix now.

yconverge: pass meaningful kubectl output through raw

Drops the "  [yconverge] <step>: " prefix on internal kubectl step
output — users already recognize kubectl's own format. Expected
"nothing to do" outputs (delete's "No resources found") are still
silenced via the _empty_ok arg, so a clean re-apply shows only the
raw lines the user cares about. The label arg is dropped.

Reverts the --skip-checks negative test back to grepping for any
"[yconverge]" since kubectl step output no longer carries the prefix.
…dry-run

Flag redesign — modes are mutually exclusive with up-front validation:
  default (apply), --diff=true, --checks-only, --print-deps.
Modifiers --dry-run=server|none and --skip-checks are gated to apply
mode. --dry-run=client is rejected (kubectl's own --server-side
limitation); --dry-run=true no longer silently maps to server. Help
prints without --context for kubectl yconverge, help, --help, -h,
or no args.

Inlined y-yconverge-deps into kubectl-yconverge as _find_cue_dir /
_find_imports / _resolve_deps; _VISITED moved from a tempfile to a
shell variable. The duplicate indirection logic is gone — a single
source of truth for "find yconverge.cue via one-level single-directory
indirection, ignoring file resources like a sibling PDB."

replace-mode dry-run: delete step now forwards --dry-run=$DRY_RUN so
kubectl itself simulates and prints "(server dry run)" without
mutating. Previously the delete was skipped entirely under dry-run,
hiding intent and leaving "does dry-run delete anything?" untestable.

New CUE examples demonstrating shared checks:
  example-db/checks/checks.cue   — #DbChecks: { replicas, list } template
  example-db/single/yconverge.cue      — imports, unifies replicas: 1,
                                         adds an exec check asserting
                                         no PDB requires >1 replica
  example-db/distributed/yconverge.cue — imports, unifies replicas: 3

cluster-prod/db gains an ad-hoc pdb.yaml (minAvailable: 2) alongside
the distributed base — exercises the "ignore file resources when
counting directory resources for indirection" rule and also gives the
single-replica check something to fail against when running prod→qa
without the recovery delete.

Tests added in yconverge/itest/test.sh:
  - example-replace/ with a converge-mode=replace Job, verifying
    --dry-run=server prints "(server dry run)" and preserves the Job
    metadata.uid (i.e. delete is provably non-mutating).

Framework correctness fixes uncovered while wiring the above:
  - exec-kind check retry loop now honors the schema's timeout field
    and returns non-zero on final failure (previously it retried 15×2s
    and silently claimed success).
  - NS_GUESS is exported so exec check commands can target the
    resolved namespace without hard-coding it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The workflow runs both script linting and yconverge integration tests,
so "lint" was misleading. Renames the file and updates the
workflow_call reference in images.yaml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@solsson
Copy link
Copy Markdown
Collaborator Author

solsson commented Apr 22, 2026

superceded by #76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant