Skip to content

Conversation

@evan-cz
Copy link
Contributor

@evan-cz evan-cz commented Jan 5, 2026

CP-36280: Add command failure tracking to anaximander diagnostic script

A recent customer incident revealed that when anaximander diagnostic commands fail,
the output files are simply missing, making it difficult to notice during support
triage. Specifically, cAdvisor metrics were not being collected for a customer, and
the absence of data wasn't caught because there was no indication of failure.

Additionally, the script was using outdated label selectors that no longer match
the helm chart's consistent labeling scheme introduced in a recent release.

Functional Change:

Before: Diagnostic command failures resulted in missing output files with no
indication of what failed. The script used app.kubernetes.io/component=server
label selector which no longer exists in the helm chart.

After: All diagnostic command results are tracked in command-results.txt showing
[SUCCESS] or [FAILED] status. Failed commands have their output files prefixed with
failure details. A warning is displayed at the end if any commands failed. Label
selector updated to use the current helm chart labeling scheme.

Solution:

  1. Added run_diagnostic() wrapper function that:

    • Logs command success/failure to command-results.txt
    • Prepends [FAILED] notice with exit code to output files on failure
    • Tracks failure count for final summary
  2. Converted 9 diagnostic commands to use the wrapper:

    • helm list
    • kubectl get all
    • kubectl get secrets
    • kubectl describe all
    • kubectl get events
    • kubectl get configmaps
    • kubectl get networkpolicies
    • kubectl top pods
    • kubectl get --raw (cAdvisor metrics)
  3. Added manual tracking for complex multi-command sections:

    • Calculate secret sizes
    • Gather pod logs (tracks per-container failures)
    • Gather job logs (tracks per-job failures)
    • Detect service mesh configuration
    • Gather scrape configuration
  4. Fixed server pod label selector from app.kubernetes.io/component=server to
    app.kubernetes.io/part-of=cloudzero-agent,app.kubernetes.io/name=server

  5. Added final summary displaying warning when commands failed

Validation:

  • Tested on multiple deployed clusters
  • Tested failure scenario by running against namespace without CloudZero Agent -
    script completes gracefully, "No server pod found" message displayed
  • Verified command-results.txt output format shows clear success/failure status
  • Verified cadvisor-metrics.txt preserves header information with metrics data

@evan-cz evan-cz requested a review from a team as a code owner January 5, 2026 18:43
@evan-cz evan-cz marked this pull request as draft January 5, 2026 18:45
A recent customer incident revealed that when anaximander diagnostic commands fail,
the output files are simply missing, making it difficult to notice during support
triage. Specifically, cAdvisor metrics were not being collected for a customer, and
the absence of data wasn't caught because there was no indication of failure.

Additionally, the script was using outdated label selectors that no longer match
the helm chart's consistent labeling scheme introduced in a recent release.

Functional Change:

Before: Diagnostic command failures resulted in missing output files with no
indication of what failed. The script used `app.kubernetes.io/component=server`
label selector which no longer exists in the helm chart.

After: All diagnostic command results are tracked in `command-results.txt` showing
[SUCCESS] or [FAILED] status. Failed commands have their output files prefixed with
failure details. A warning is displayed at the end if any commands failed. Label
selector updated to use the current helm chart labeling scheme.

Solution:

1. Added `run_diagnostic()` wrapper function that:
   - Logs command success/failure to `command-results.txt`
   - Prepends `[FAILED]` notice with exit code to output files on failure
   - Tracks failure count for final summary

2. Converted 9 diagnostic commands to use the wrapper:
   - helm list
   - kubectl get all
   - kubectl get secrets
   - kubectl describe all
   - kubectl get events
   - kubectl get configmaps
   - kubectl get networkpolicies
   - kubectl top pods
   - kubectl get --raw (cAdvisor metrics)

3. Added manual tracking for complex multi-command sections:
   - Calculate secret sizes
   - Gather pod logs (tracks per-container failures)
   - Gather job logs (tracks per-job failures)
   - Detect service mesh configuration
   - Gather scrape configuration

4. Fixed server pod label selector from `app.kubernetes.io/component=server` to
   `app.kubernetes.io/part-of=cloudzero-agent,app.kubernetes.io/name=server`

5. Added final summary displaying warning when commands failed

Validation:

- Tested on multiple deployed clusters
- Tested failure scenario by running against namespace without CloudZero Agent -
  script completes gracefully, "No server pod found" message displayed
- Verified command-results.txt output format shows clear success/failure status
- Verified cadvisor-metrics.txt preserves header information with metrics data
@evan-cz evan-cz marked this pull request as ready for review January 5, 2026 19:00
@evan-cz evan-cz added this pull request to the merge queue Jan 5, 2026
Merged via the queue into develop with commit 213bda4 Jan 5, 2026
44 checks passed
@evan-cz evan-cz deleted the CP-36280 branch January 5, 2026 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants