Add onPodError matcher to categorize pre-startup pod failures#4891
Add onPodError matcher to categorize pre-startup pod failures#4891dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
Conversation
Greptile SummaryThis PR adds Confidence Score: 5/5Safe to merge; only P2 style findings, no logic or correctness issues. All changes are additive and opt-in (no default rules shipped). The rename from internal/executor/categorizer/classifier_test.go — test dispatch heuristic leaves Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Pod event detected] --> B{Pod Phase}
B -- PodFailed --> C[job_state_reporter.go\nreportCurrentStatus]
C --> D[ClassifyContainerError\npod state only]
D --> E[Match: onConditions\nonExitCodes\nonTerminationMessage]
B -- Pending / Stuck / Deadline --> F[pod_issue_handler.go\ndetectPodIssues]
F --> G{Retryable?}
G -- Yes --> H[handleRetryableJobIssue\nReturn lease, no classification]
G -- No --> I[handleNonRetryableJobIssue]
I --> J[ClassifyPodError\npod state + podIssue.Message]
J --> K[Match: onConditions\nonExitCodes\nonTerminationMessage\nonPodError NEW]
K --> L[ClassifyResult\nCategory + Subcategory]
E --> L
L --> M[CreateJobFailedEvent\nfailure_category / failure_subcategory]
Reviews (6): Last reviewed commit: "Add onPodError matcher to categorize pre..." | Re-trigger Greptile |
2d58d38 to
e36ca03
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
e36ca03 to
d1690ff
Compare
Summary
The failure categorizer's existing matchers (
onConditions,onExitCodes,onTerminationMessage) operate on per-container state inpod.Status. They don't see failures that produce no useful container terminationMessage: pre-startup kubelet/runtime errors (image pull, missing volume, missing ConfigMap/Secret) and Armada-detected pod-level failures (stuck terminating, active deadline exceeded, externally deleted). These end up with emptyfailure_categoryandfailure_subcategoryin lookoutdb today.This PR adds
onPodError, a new rule matcher dedicated to pod-level error text, so operators can write rules like:A separate PR (#4890) adds curated diagnostic hints to user-facing failure messages. Each PR is independently shippable; together they deliver the full feature.
Approach
OnPodErroronCategoryRule. Matches a regex against the issue's pod-level error message.ContainerNamescoping is ignored (pod-level text has no container attribution).onTerminationMessageis unchanged — still matches containerTerminated.Message, still honorsContainerName. Non-overlapping data source fromOnPodErrorby design.Classify(pod, podErrorMessage string)— second arg carries the failure message the executor captured. Needed because kubelet rotatesWaiting.ReasonfromErrImagePulltoImagePullBackOffwithin seconds, replacingWaiting.Messagewith a generic backoff string, so by the time Armada classifies the pod the runtime error is no longer inpod.Status.Validation
To reproduce on local dev:
1. Add an
onPodErrorrule to your executor config (_local/executor/config.yamlunderapplication:):Categorization is opt-in: Armada ships no default rules.
2. Submit a wrong-arch job (
example/platform-mismatch.yaml):armadactl create queue test armadactl submit example/platform-mismatch.yaml3. Wait for the kubelet event-based fail check to fire (typically 1-5 minutes).
4. Verify the categorization landed:
Expected:
Live-validated end-to-end on macOS arm64 (M3) against a k3d cluster.