Skip to content

Conversation

@evan-cz
Copy link
Contributor

@evan-cz evan-cz commented Dec 18, 2025

While adding comprehensive per-type unit tests for defaults.* properties, discovered that defaults.image.pullSecrets was not being applied to most workload templates. Only config-loader-job.yaml and helmless-job.yaml were using the correct generateImagePullSecrets helper; six other templates used legacy helpers that did not fall back to defaults.image.pullSecrets.

Functional Change:

Before: Setting defaults.image.pullSecrets had no effect on agent-deploy, aggregator-deploy, agent-daemonset, webhook-deploy, backfill-job, or init-cert-job templates. Users had to set the deprecated top-level imagePullSecrets or configure each component individually.

After: Setting defaults.image.pullSecrets applies to all workload templates. Backwards compatibility with the deprecated imagePullSecrets is preserved.

Root Cause:

Six templates were using legacy imagePullSecrets helpers that had different fallback chains:

  1. cloudzero-agent.server.imagePullSecrets - only checked .Values.imagePullSecrets
  2. cloudzero-agent.insightsController.server.imagePullSecrets - checked insightsController.server.imagePullSecrets -> imagePullSecrets
  3. cloudzero-agent.initBackfillJob.imagePullSecrets - checked backFillValues.imagePullSecrets -> insightsController.server.imagePullSecrets -> imagePullSecrets
  4. cloudzero-agent.initCertJob.imagePullSecrets - similar chain

None of these helpers included defaults.image.pullSecrets in their fallback chain.

Solution:

  1. Updated generateImagePullSecrets helper (_helpers.tpl:1137-1142) to include deprecated imagePullSecrets as final fallback: .image.pullSecrets | default .root.Values.defaults.image.pullSecrets | default .root.Values.imagePullSecrets

  2. Updated six templates to use generateImagePullSecrets:

    • agent-deploy.yaml (line 310)
    • aggregator-deploy.yaml (line 159)
    • agent-daemonset.yaml (line 190)
    • webhook-deploy.yaml (line 86)
    • backfill-job.yaml (line 122)
    • init-cert-job.yaml (line 54)
  3. Also fixed aggregator-service.yaml to use generateLabels for consistency with other templates (removes app.kubernetes.io/component which was inconsistently applied across resources).

Validation:

  • All 395 Helm unit tests pass (8 new tests added for defaults.image.pullSecrets)
  • New tests verify defaults.image.pullSecrets applies to Deployment, DaemonSet, Job, and CronJob resources
  • New tests verify backwards compatibility with deprecated imagePullSecrets
  • Manual helm template verification confirms imagePullSecrets renders correctly when defaults.image.pullSecrets is set

@evan-cz evan-cz requested a review from a team as a code owner December 18, 2025 19:15
…plates

While adding comprehensive per-type unit tests for defaults.* properties, discovered
that defaults.image.pullSecrets was not being applied to most workload templates.
Only config-loader-job.yaml and helmless-job.yaml were using the correct
generateImagePullSecrets helper; six other templates used legacy helpers that
did not fall back to defaults.image.pullSecrets.

Functional Change:

Before: Setting defaults.image.pullSecrets had no effect on agent-deploy,
aggregator-deploy, agent-daemonset, webhook-deploy, backfill-job, or
init-cert-job templates. Users had to set the deprecated top-level
imagePullSecrets or configure each component individually.

After: Setting defaults.image.pullSecrets applies to all workload templates.
Backwards compatibility with the deprecated imagePullSecrets is preserved.

Root Cause:

Six templates were using legacy imagePullSecrets helpers that had different
fallback chains:

1. cloudzero-agent.server.imagePullSecrets - only checked .Values.imagePullSecrets
2. cloudzero-agent.insightsController.server.imagePullSecrets - checked
   insightsController.server.imagePullSecrets -> imagePullSecrets
3. cloudzero-agent.initBackfillJob.imagePullSecrets - checked
   backFillValues.imagePullSecrets -> insightsController.server.imagePullSecrets
   -> imagePullSecrets
4. cloudzero-agent.initCertJob.imagePullSecrets - similar chain

None of these helpers included defaults.image.pullSecrets in their fallback chain.

Solution:

1. Updated generateImagePullSecrets helper (_helpers.tpl:1137-1142) to include
   deprecated imagePullSecrets as final fallback:
   .image.pullSecrets | default .root.Values.defaults.image.pullSecrets
   | default .root.Values.imagePullSecrets

2. Updated six templates to use generateImagePullSecrets:
   - agent-deploy.yaml (line 310)
   - aggregator-deploy.yaml (line 159)
   - agent-daemonset.yaml (line 190)
   - webhook-deploy.yaml (line 86)
   - backfill-job.yaml (line 122)
   - init-cert-job.yaml (line 54)

3. Also fixed aggregator-service.yaml to use generateLabels for consistency
   with other templates (removes app.kubernetes.io/component which was
   inconsistently applied across resources).

Validation:

- All 395 Helm unit tests pass (8 new tests added for defaults.image.pullSecrets)
- New tests verify defaults.image.pullSecrets applies to Deployment, DaemonSet,
  Job, and CronJob resources
- New tests verify backwards compatibility with deprecated imagePullSecrets
- Manual helm template verification confirms imagePullSecrets renders correctly
  when defaults.image.pullSecrets is set
@evan-cz evan-cz added this pull request to the merge queue Jan 5, 2026
Merged via the queue into develop with commit 7b89f8c Jan 5, 2026
44 checks passed
@evan-cz evan-cz deleted the CP-35716 branch January 5, 2026 18:49
evan-cz added a commit that referenced this pull request Jan 6, 2026
PR #606 code review identified that documentation contains outdated Kubernetes
label selectors. The label system was refactored in CP-35429 (commit c7a0b6f)
to follow Kubernetes recommended labels best practices, but documentation was
not updated to reflect the new selector patterns.

Functional Change:

Before: Documentation examples used `app.kubernetes.io/component=server` combined
with `app.kubernetes.io/name=cloudzero-agent` to query agent resources, which no
longer matches the actual labels on deployed resources.

After: Documentation examples use `app.kubernetes.io/part-of=cloudzero-agent`
combined with `app.kubernetes.io/name=server` (or aggregator, webhook-server, etc.),
which correctly matches the current label schema.

Solution:

1. Updated helm/docs/troubleshooting-guide.md (~50 instances) - main troubleshooting
   documentation with extensive kubectl examples
2. Updated helm/docs/deploy-validation.md (6 instances) - deployment validation guide
3. Updated helm/docs/cert-trouble-shooting.md - example output showing new labels
4. Updated helm/docs/upgrades.md - job deletion command selector
5. Updated docs/wiki/Debugging-Guide.md (5 instances) - debugging procedures
6. Updated docs/wiki/Installation-FAQ.md (7 instances) - FAQ kubectl examples
7. Updated docs/wiki/CloudZero-Agent-Replicas-and-Resources.md (2 instances)
8. Updated app/domain/transform/dcgm/CLAUDE.md and README.md - aggregator examples
9. Updated app/functions/helmless/README.md - helmless job log retrieval
10. Deleted obsolete docs/wiki/Webhook-Server.md.bak backup file

Label transformation pattern applied throughout:
- `app.kubernetes.io/component=server,app.kubernetes.io/name=cloudzero-agent`
  became `app.kubernetes.io/part-of=cloudzero-agent,app.kubernetes.io/name=server`
- kube-state-metrics uses `app.kubernetes.io/name=kube-state-metrics` (sub-chart,
  no part-of label)

Validation:

- Verified all documentation files with grep for `app.kubernetes.io/component`
- Confirmed remaining occurrences are intentional (label export config, test fixtures,
  _helpers.tpl source code)
- No functional changes to code - documentation only
evan-cz added a commit that referenced this pull request Jan 6, 2026
PR #606 code review identified that documentation contains outdated Kubernetes
label selectors. The label system was refactored in CP-35429 (commit c7a0b6f)
to follow Kubernetes recommended labels best practices, but documentation was
not updated to reflect the new selector patterns.

Functional Change:

Before: Documentation examples used `app.kubernetes.io/component=server` combined
with `app.kubernetes.io/name=cloudzero-agent` to query agent resources, which no
longer matches the actual labels on deployed resources.

After: Documentation examples use `app.kubernetes.io/part-of=cloudzero-agent`
combined with `app.kubernetes.io/name=server` (or aggregator, webhook-server, etc.),
which correctly matches the current label schema.

Solution:

1. Updated helm/docs/troubleshooting-guide.md (~50 instances) - main troubleshooting
   documentation with extensive kubectl examples
2. Updated helm/docs/deploy-validation.md (6 instances) - deployment validation guide
3. Updated helm/docs/cert-trouble-shooting.md - example output showing new labels
4. Updated helm/docs/upgrades.md - job deletion command selector
5. Updated docs/wiki/Debugging-Guide.md (5 instances) - debugging procedures
6. Updated docs/wiki/Installation-FAQ.md (7 instances) - FAQ kubectl examples
7. Updated docs/wiki/CloudZero-Agent-Replicas-and-Resources.md (2 instances)
8. Updated app/domain/transform/dcgm/CLAUDE.md and README.md - aggregator examples
9. Updated app/functions/helmless/README.md - helmless job log retrieval
10. Deleted obsolete docs/wiki/Webhook-Server.md.bak backup file

Label transformation pattern applied throughout:
- `app.kubernetes.io/component=server,app.kubernetes.io/name=cloudzero-agent`
  became `app.kubernetes.io/part-of=cloudzero-agent,app.kubernetes.io/name=server`
- kube-state-metrics uses `app.kubernetes.io/name=kube-state-metrics` (sub-chart,
  no part-of label)

Validation:

- Verified all documentation files with grep for `app.kubernetes.io/component`
- Confirmed remaining occurrences are intentional (label export config, test fixtures,
  _helpers.tpl source code)
- No functional changes to code - documentation only
github-merge-queue bot pushed a commit that referenced this pull request Jan 6, 2026
* CP-36332: Update docs to use new Kubernetes label selectors

PR #606 code review identified that documentation contains outdated Kubernetes
label selectors. The label system was refactored in CP-35429 (commit c7a0b6f)
to follow Kubernetes recommended labels best practices, but documentation was
not updated to reflect the new selector patterns.

Functional Change:

Before: Documentation examples used `app.kubernetes.io/component=server` combined
with `app.kubernetes.io/name=cloudzero-agent` to query agent resources, which no
longer matches the actual labels on deployed resources.

After: Documentation examples use `app.kubernetes.io/part-of=cloudzero-agent`
combined with `app.kubernetes.io/name=server` (or aggregator, webhook-server, etc.),
which correctly matches the current label schema.

Solution:

1. Updated helm/docs/troubleshooting-guide.md (~50 instances) - main troubleshooting
   documentation with extensive kubectl examples
2. Updated helm/docs/deploy-validation.md (6 instances) - deployment validation guide
3. Updated helm/docs/cert-trouble-shooting.md - example output showing new labels
4. Updated helm/docs/upgrades.md - job deletion command selector
5. Updated docs/wiki/Debugging-Guide.md (5 instances) - debugging procedures
6. Updated docs/wiki/Installation-FAQ.md (7 instances) - FAQ kubectl examples
7. Updated docs/wiki/CloudZero-Agent-Replicas-and-Resources.md (2 instances)
8. Updated app/domain/transform/dcgm/CLAUDE.md and README.md - aggregator examples
9. Updated app/functions/helmless/README.md - helmless job log retrieval
10. Deleted obsolete docs/wiki/Webhook-Server.md.bak backup file

Label transformation pattern applied throughout:
- `app.kubernetes.io/component=server,app.kubernetes.io/name=cloudzero-agent`
  became `app.kubernetes.io/part-of=cloudzero-agent,app.kubernetes.io/name=server`
- kube-state-metrics uses `app.kubernetes.io/name=kube-state-metrics` (sub-chart,
  no part-of label)

Validation:

- Verified all documentation files with grep for `app.kubernetes.io/component`
- Confirmed remaining occurrences are intentional (label export config, test fixtures,
  _helpers.tpl source code)
- No functional changes to code - documentation only

* Standardize namespace to cloudzero-agent in documentation

Documentation used inconsistent namespace names (`cza`, `cz-agent`, `cz-webhook-test`,
`cloudzero`) in kubectl command examples. This creates confusion for users following
the documentation and may cause commands to fail if users copy them directly.

Functional Change:

Before: Documentation examples used various short namespace names like `-n cza`,
`-n cz-agent`, `-n cz-webhook-test`, or `-n cloudzero` inconsistently across files.

After: All documentation examples consistently use `-n cloudzero-agent`, which is
the default namespace for the CloudZero Agent Helm chart installation.

Solution:

1. Updated helm/DEVELOPMENT.md - HPA verification commands (2 instances)
2. Updated app/functions/helmless/README.md - ConfigMap extraction commands (2 instances)
3. Updated app/functions/certifik8s/README.md - ServiceAccount YAML examples (2 instances)
4. Updated tests/load/README.md - Load testing kubectl commands (8 instances)
5. Updated tests/testkube/README.md - TestKube debug commands (2 instances)
6. Updated tests/webhook/README.md - Webhook metrics port-forward (1 instance)

Namespace transformation pattern applied:
- `cza` -> `cloudzero-agent`
- `cz-agent` -> `cloudzero-agent`
- `cz-webhook-test` -> `cloudzero-agent`

Validation:

- Verified with grep that no remaining instances of `-n cza`, `-n cz-agent`,
  `-n cz-webhook-test`, or `-n cloudzero` (without -agent) exist in markdown files
- External component namespaces intentionally preserved (cert-manager, kube-system,
  istio-system, monitoring, etc.)
- No functional changes to code - documentation only
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants