Add onPodError matcher to categorize pre-startup pod failures by dejanzele · Pull Request #4891 · armadaproject/armada

dejanzele · 2026-04-30T11:06:14Z

Summary

The failure categorizer's existing matchers (onConditions, onExitCodes, onTerminationMessage) operate on per-container state in pod.Status. They don't see failures that produce no useful container terminationMessage: pre-startup kubelet/runtime errors (image pull, missing volume, missing ConfigMap/Secret) and Armada-detected pod-level failures (stuck terminating, active deadline exceeded, externally deleted). These end up with empty failure_category and failure_subcategory in lookoutdb today.

This PR adds onPodError, a new rule matcher dedicated to pod-level error text, so operators can write rules like:

- onPodError:
    pattern: "no match for platform in manifest"
  subcategory: "platform_mismatch"

A separate PR (#4890) adds curated diagnostic hints to user-facing failure messages. Each PR is independently shippable; together they deliver the full feature.

Approach

New rule field OnPodError on CategoryRule. Matches a regex against the issue's pod-level error message. ContainerName scoping is ignored (pod-level text has no container attribution).
onTerminationMessage is unchanged — still matches container Terminated.Message, still honors ContainerName. Non-overlapping data source from OnPodError by design.
Classify(pod, podErrorMessage string) — second arg carries the failure message the executor captured. Needed because kubelet rotates Waiting.Reason from ErrImagePull to ImagePullBackOff within seconds, replacing Waiting.Message with a generic backoff string, so by the time Armada classifies the pod the runtime error is no longer in pod.Status.
Config validation: a rule must specify exactly one of the four matchers; regex matchers compile-check at startup so invalid patterns fail fast.

Validation

To reproduce on local dev:

1. Add an onPodError rule to your executor config (_local/executor/config.yaml under application:):

application:
  errorCategories:
    defaultCategory: "uncategorized"
    defaultSubcategory: "unknown"
    categories:
      - name: infrastructure
        rules:
          - onPodError:
              pattern: "no match for platform in manifest"
            subcategory: "platform_mismatch"

Categorization is opt-in: Armada ships no default rules.

2. Submit a wrong-arch job (example/platform-mismatch.yaml):

queue: test
jobSetId: platform-mismatch-repro
jobs:
  - namespace: default
    priority: 0
    podSpec:
      terminationGracePeriodSeconds: 0
      restartPolicy: Never
      containers:
        - name: wrong-arch
          image: amd64/busybox:latest
          command:
            - sh
            - -c
            - echo should-never-run
          resources:
            requests:
              memory: 64Mi
              cpu: "0.1"
            limits:
              memory: 64Mi
              cpu: "0.1"

armadactl create queue test
armadactl submit example/platform-mismatch.yaml

3. Wait for the kubelet event-based fail check to fire (typically 1-5 minutes).

4. Verify the categorization landed:

docker exec postgres psql -U postgres -d lookout -c \
  "SELECT job_id, run_id, finished, failure_category, failure_subcategory
   FROM job_run
   ORDER BY finished DESC NULLS LAST
   LIMIT 1;"

Expected:

 failure_category | failure_subcategory
------------------+---------------------
 infrastructure   | platform_mismatch

Live-validated end-to-end on macOS arm64 (M3) against a k3d cluster.

greptile-apps · 2026-04-30T11:11:02Z

Greptile Summary

This PR adds onPodError, a new pod-level rule matcher for the failure categorizer, covering pre-startup failures (image pull, missing volume, active deadline exceeded, etc.) that produce no useful container terminationMessage. It splits the existing Classify method into ClassifyContainerError (terminal PodFailed path) and ClassifyPodError (issue-handler path, receives the executor-captured error string), and validates rules at NewClassifier time so invalid regexes fail fast at startup.

Confidence Score: 5/5

Safe to merge; only P2 style findings, no logic or correctness issues.

All changes are additive and opt-in (no default rules shipped). The rename from Classify to ClassifyContainerError/ClassifyPodError is mechanical and fully covered by call-site updates. The only findings are P2: a test coverage gap and a minor doc comment observation.

internal/executor/categorizer/classifier_test.go — test dispatch heuristic leaves ClassifyPodError(pod, "") untested.

Important Files Changed

Filename	Overview
internal/executor/categorizer/classifier.go	Adds `onPodError` regex matcher to `rule`; splits `Classify` into `ClassifyContainerError` / `ClassifyPodError` backed by a private `classify(pod, podErrorMessage)`; updates `ruleMatches` to pass the message and short-circuits correctly with the `podErrorMessage != ""` guard. Logic is clean and correct.
internal/executor/categorizer/classifier_test.go	Adds four new test cases for `onPodError`; the dispatch heuristic (`if podErrorMessage == ""`) never calls `ClassifyPodError` with an empty string, leaving that code path untested.
internal/executor/categorizer/types.go	Adds `OnPodError *errormatch.RegexMatcher` field to `CategoryRule` with clear doc comments; no issues.
internal/executor/categorizer/doc.go	Updates package-level documentation for the new matcher and the renamed API; incidentally removes a pre-existing `AppError` documentation error that never existed as a real condition constant.
internal/executor/service/pod_issue_handler.go	Switches `handleNonRetryableJobIssue` to `ClassifyPodError(podIssue.OriginalPodState, podIssue.Message)`, correctly forwarding the executor-captured error message for pod-level rule matching.
internal/executor/service/pod_issue_handler_test.go	Adds `TestPodIssueService_OnPodErrorClassifies` with two end-to-end cases (platform mismatch via kubelet error, deadline exceeded); helper `podErrorClassifier` keeps setup clean.
internal/executor/service/job_state_reporter.go	Correctly renames `Classify` → `ClassifyContainerError` for the `PodFailed` terminal path, which doesn't have access to a pod error message.
internal/executor/reporter/event_test.go	Updates call site from `classifier.Classify` to `classifier.ClassifyContainerError`; purely mechanical rename, no logic change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pod event detected] --> B{Pod Phase}
    B -- PodFailed --> C[job_state_reporter.go\nreportCurrentStatus]
    C --> D[ClassifyContainerError\npod state only]
    D --> E[Match: onConditions\nonExitCodes\nonTerminationMessage]

    B -- Pending / Stuck / Deadline --> F[pod_issue_handler.go\ndetectPodIssues]
    F --> G{Retryable?}
    G -- Yes --> H[handleRetryableJobIssue\nReturn lease, no classification]
    G -- No --> I[handleNonRetryableJobIssue]
    I --> J[ClassifyPodError\npod state + podIssue.Message]
    J --> K[Match: onConditions\nonExitCodes\nonTerminationMessage\nonPodError NEW]
    K --> L[ClassifyResult\nCategory + Subcategory]
    E --> L
    L --> M[CreateJobFailedEvent\nfailure_category / failure_subcategory]

_{Reviews (6): Last reviewed commit: "Add onPodError matcher to categorize pre..." | Re-trigger Greptile}

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele · 2026-04-30T13:12:06Z

@greptileai

dejanzele mentioned this pull request Apr 30, 2026

Add diagnostic hint catalog for well-known executor failure modes #4890

Open

dejanzele force-pushed the categorizer-on-pod-error branch 5 times, most recently from 2d58d38 to e36ca03 Compare April 30, 2026 13:03

Add onPodError matcher to categorize pre-startup pod failures

d1690ff

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele force-pushed the categorizer-on-pod-error branch from e36ca03 to d1690ff Compare April 30, 2026 13:10

mauriceyap approved these changes Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add onPodError matcher to categorize pre-startup pod failures#4891

Add onPodError matcher to categorize pre-startup pod failures#4891
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:categorizer-on-pod-error

dejanzele commented Apr 30, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

dejanzele commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dejanzele commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Validation

Uh oh!

greptile-apps Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

dejanzele commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dejanzele commented Apr 30, 2026 •

edited

Loading

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading