CP-36280: Add command failure tracking to anaximander diagnostic script #619
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CP-36280: Add command failure tracking to anaximander diagnostic script
A recent customer incident revealed that when anaximander diagnostic commands fail,
the output files are simply missing, making it difficult to notice during support
triage. Specifically, cAdvisor metrics were not being collected for a customer, and
the absence of data wasn't caught because there was no indication of failure.
Additionally, the script was using outdated label selectors that no longer match
the helm chart's consistent labeling scheme introduced in a recent release.
Functional Change:
Before: Diagnostic command failures resulted in missing output files with no
indication of what failed. The script used
app.kubernetes.io/component=serverlabel selector which no longer exists in the helm chart.
After: All diagnostic command results are tracked in
command-results.txtshowing[SUCCESS] or [FAILED] status. Failed commands have their output files prefixed with
failure details. A warning is displayed at the end if any commands failed. Label
selector updated to use the current helm chart labeling scheme.
Solution:
Added
run_diagnostic()wrapper function that:command-results.txt[FAILED]notice with exit code to output files on failureConverted 9 diagnostic commands to use the wrapper:
Added manual tracking for complex multi-command sections:
Fixed server pod label selector from
app.kubernetes.io/component=servertoapp.kubernetes.io/part-of=cloudzero-agent,app.kubernetes.io/name=serverAdded final summary displaying warning when commands failed
Validation:
script completes gracefully, "No server pod found" message displayed