Add diagnostic hint catalog for well-known executor failure modes#4890
Add diagnostic hint catalog for well-known executor failure modes#4890dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR introduces The implementation is correct: Confidence Score: 5/5Safe to merge — hint injection is additive, paths are mutually exclusive, and the retryability preface is correctly preserved. No P0 or P1 findings. All three injection sites are correct and non-overlapping. The previously flagged retryable-preface issue is resolved in this version. Test coverage spans all four retryable/non-retryable x hint/no-hint combinations. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Pod State Change Detected] --> B{Pod Phase?}
B -->|PodFailed| C[job_state_reporter.go\nCreateEventForCurrentState]
C --> D[ExtractPodFailedReason]
D --> E[diagnostics.LookupHint]
E -->|hint found| F[reason = hint + reason]
E -->|no hint| G[reason unchanged]
F --> H[CreateJobFailedEvent]
G --> H
H --> I{DetectAndRegisterFailedPodIssue?}
I -->|issueAdded=true retryable| J[Store podIssue.Message\nhint + original\nDiscard event]
I -->|issueAdded=false non-retryable| K[Queue event with hint]
J --> L[handleRetryableJobIssue\nReturn lease / retry]
B -->|PodPending/Unknown| M[detectPodIssues\npendingPodChecker.GetAction]
M -->|action != Wait| N[createStuckPodMessage\nretryable bool + originalMessage]
N --> O[diagnostics.LookupHint]
O -->|hint found| P[preface + hint + original]
O -->|no hint| Q[preface + original]
P --> R[Store podIssue.Message]
Q --> R
R --> S[handleNonRetryableJobIssue\nCreateJobFailedEvent with stored message]
Reviews (4): Last reviewed commit: "Add diagnostic hint catalog for well-kno..." | Re-trigger Greptile |
3a190a2 to
5190047
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
5190047 to
96631a6
Compare
Summary
When a job fails for a well-known reason (e.g. wrong-arch image), the user-facing error in Lookout is the raw kubelet text, often opaque. This PR adds a
diagnosticspackage that maps such errors to actionable hints; the hint is prepended to the failure message at the three executor sites that build user-facing text (reporter/event.goPodFailed branch,service/pod_issue_handler.goretryable failed-pod path, andservice/pod_issue_handler.gocreateStuckPodMessage).The package is separate from the categorizer because the two are orthogonal: diagnostic hints are always-on and curated, categorization is opt-in and operator-configured. Splitting them avoids forcing a categorizer dependency on hint consumers and lets either side evolve without churning the other.
Adding a hint is a single
{pattern, text}entry inbuiltinHints. Patterns compile at startup; first match wins. A future PR may move the catalog from compiled-in to config-driven once we see whether operators want deployment-specific entries.Validation
To reproduce on local dev:
1. Submit a wrong-arch job (
example/platform-mismatch.yaml):2. Wait for the kubelet event-based fail check to fire (typically 1-5 minutes depending on
event_checksgrace period).3. Verify the hint appears in the lookout DB. The
errorcolumn is zlib-compressed bytea, so decode in two steps:Expected (and observed) result
Decompressed
errorcolumn begins with the curated hint, followed by the raw kubelet error preserved verbatim:Live-validated end-to-end on macOS arm64 (M3) against a k3d cluster.