CP-36280: Add command failure tracking to anaximander diagnostic script #619

evan-cz · 2026-01-05T18:43:16Z

CP-36280: Add command failure tracking to anaximander diagnostic script

A recent customer incident revealed that when anaximander diagnostic commands fail,
the output files are simply missing, making it difficult to notice during support
triage. Specifically, cAdvisor metrics were not being collected for a customer, and
the absence of data wasn't caught because there was no indication of failure.

Additionally, the script was using outdated label selectors that no longer match
the helm chart's consistent labeling scheme introduced in a recent release.

Functional Change:

Before: Diagnostic command failures resulted in missing output files with no
indication of what failed. The script used app.kubernetes.io/component=server
label selector which no longer exists in the helm chart.

After: All diagnostic command results are tracked in command-results.txt showing
[SUCCESS] or [FAILED] status. Failed commands have their output files prefixed with
failure details. A warning is displayed at the end if any commands failed. Label
selector updated to use the current helm chart labeling scheme.

Solution:

Added run_diagnostic() wrapper function that:
- Logs command success/failure to command-results.txt
- Prepends [FAILED] notice with exit code to output files on failure
- Tracks failure count for final summary
Converted 9 diagnostic commands to use the wrapper:
- helm list
- kubectl get all
- kubectl get secrets
- kubectl describe all
- kubectl get events
- kubectl get configmaps
- kubectl get networkpolicies
- kubectl top pods
- kubectl get --raw (cAdvisor metrics)
Added manual tracking for complex multi-command sections:
- Calculate secret sizes
- Gather pod logs (tracks per-container failures)
- Gather job logs (tracks per-job failures)
- Detect service mesh configuration
- Gather scrape configuration
Fixed server pod label selector from app.kubernetes.io/component=server to
app.kubernetes.io/part-of=cloudzero-agent,app.kubernetes.io/name=server
Added final summary displaying warning when commands failed

Validation:

Tested on multiple deployed clusters
Tested failure scenario by running against namespace without CloudZero Agent -
script completes gracefully, "No server pod found" message displayed
Verified command-results.txt output format shows clear success/failure status
Verified cadvisor-metrics.txt preserves header information with metrics data

A recent customer incident revealed that when anaximander diagnostic commands fail, the output files are simply missing, making it difficult to notice during support triage. Specifically, cAdvisor metrics were not being collected for a customer, and the absence of data wasn't caught because there was no indication of failure. Additionally, the script was using outdated label selectors that no longer match the helm chart's consistent labeling scheme introduced in a recent release. Functional Change: Before: Diagnostic command failures resulted in missing output files with no indication of what failed. The script used `app.kubernetes.io/component=server` label selector which no longer exists in the helm chart. After: All diagnostic command results are tracked in `command-results.txt` showing [SUCCESS] or [FAILED] status. Failed commands have their output files prefixed with failure details. A warning is displayed at the end if any commands failed. Label selector updated to use the current helm chart labeling scheme. Solution: 1. Added `run_diagnostic()` wrapper function that: - Logs command success/failure to `command-results.txt` - Prepends `[FAILED]` notice with exit code to output files on failure - Tracks failure count for final summary 2. Converted 9 diagnostic commands to use the wrapper: - helm list - kubectl get all - kubectl get secrets - kubectl describe all - kubectl get events - kubectl get configmaps - kubectl get networkpolicies - kubectl top pods - kubectl get --raw (cAdvisor metrics) 3. Added manual tracking for complex multi-command sections: - Calculate secret sizes - Gather pod logs (tracks per-container failures) - Gather job logs (tracks per-job failures) - Detect service mesh configuration - Gather scrape configuration 4. Fixed server pod label selector from `app.kubernetes.io/component=server` to `app.kubernetes.io/part-of=cloudzero-agent,app.kubernetes.io/name=server` 5. Added final summary displaying warning when commands failed Validation: - Tested on multiple deployed clusters - Tested failure scenario by running against namespace without CloudZero Agent - script completes gracefully, "No server pod found" message displayed - Verified command-results.txt output format shows clear success/failure status - Verified cadvisor-metrics.txt preserves header information with metrics data

evan-cz requested a review from a team as a code owner January 5, 2026 18:43

evan-cz marked this pull request as draft January 5, 2026 18:45

evan-cz force-pushed the CP-36280 branch from 8669f37 to f469cf4 Compare January 5, 2026 18:59

evan-cz marked this pull request as ready for review January 5, 2026 19:00

dmepham approved these changes Jan 5, 2026

View reviewed changes

evan-cz added this pull request to the merge queue Jan 5, 2026

Merged via the queue into develop with commit 213bda4 Jan 5, 2026
44 checks passed

evan-cz deleted the CP-36280 branch January 5, 2026 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CP-36280: Add command failure tracking to anaximander diagnostic script #619

CP-36280: Add command failure tracking to anaximander diagnostic script #619

Uh oh!

evan-cz commented Jan 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CP-36280: Add command failure tracking to anaximander diagnostic script #619

CP-36280: Add command failure tracking to anaximander diagnostic script #619

Uh oh!

Conversation

evan-cz commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evan-cz commented Jan 5, 2026 •

edited

Loading