Skip to content

Add onPodError matcher to categorize pre-startup pod failures#4891

Open
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:categorizer-on-pod-error
Open

Add onPodError matcher to categorize pre-startup pod failures#4891
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:categorizer-on-pod-error

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 30, 2026

Summary

The failure categorizer's existing matchers (onConditions, onExitCodes, onTerminationMessage) operate on per-container state in pod.Status. They don't see failures that produce no useful container terminationMessage: pre-startup kubelet/runtime errors (image pull, missing volume, missing ConfigMap/Secret) and Armada-detected pod-level failures (stuck terminating, active deadline exceeded, externally deleted). These end up with empty failure_category and failure_subcategory in lookoutdb today.

This PR adds onPodError, a new rule matcher dedicated to pod-level error text, so operators can write rules like:

- onPodError:
    pattern: "no match for platform in manifest"
  subcategory: "platform_mismatch"

A separate PR (#4890) adds curated diagnostic hints to user-facing failure messages. Each PR is independently shippable; together they deliver the full feature.

Approach

  • New rule field OnPodError on CategoryRule. Matches a regex against the issue's pod-level error message. ContainerName scoping is ignored (pod-level text has no container attribution).
  • onTerminationMessage is unchanged — still matches container Terminated.Message, still honors ContainerName. Non-overlapping data source from OnPodError by design.
  • Classify(pod, podErrorMessage string) — second arg carries the failure message the executor captured. Needed because kubelet rotates Waiting.Reason from ErrImagePull to ImagePullBackOff within seconds, replacing Waiting.Message with a generic backoff string, so by the time Armada classifies the pod the runtime error is no longer in pod.Status.
  • Config validation: a rule must specify exactly one of the four matchers; regex matchers compile-check at startup so invalid patterns fail fast.

Validation

To reproduce on local dev:

1. Add an onPodError rule to your executor config (_local/executor/config.yaml under application:):

application:
  errorCategories:
    defaultCategory: "uncategorized"
    defaultSubcategory: "unknown"
    categories:
      - name: infrastructure
        rules:
          - onPodError:
              pattern: "no match for platform in manifest"
            subcategory: "platform_mismatch"

Categorization is opt-in: Armada ships no default rules.

2. Submit a wrong-arch job (example/platform-mismatch.yaml):

queue: test
jobSetId: platform-mismatch-repro
jobs:
  - namespace: default
    priority: 0
    podSpec:
      terminationGracePeriodSeconds: 0
      restartPolicy: Never
      containers:
        - name: wrong-arch
          image: amd64/busybox:latest
          command:
            - sh
            - -c
            - echo should-never-run
          resources:
            requests:
              memory: 64Mi
              cpu: "0.1"
            limits:
              memory: 64Mi
              cpu: "0.1"
armadactl create queue test
armadactl submit example/platform-mismatch.yaml

3. Wait for the kubelet event-based fail check to fire (typically 1-5 minutes).

4. Verify the categorization landed:

docker exec postgres psql -U postgres -d lookout -c \
  "SELECT job_id, run_id, finished, failure_category, failure_subcategory
   FROM job_run
   ORDER BY finished DESC NULLS LAST
   LIMIT 1;"

Expected:

 failure_category | failure_subcategory
------------------+---------------------
 infrastructure   | platform_mismatch

Live-validated end-to-end on macOS arm64 (M3) against a k3d cluster.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR adds onPodError, a new pod-level rule matcher for the failure categorizer, covering pre-startup failures (image pull, missing volume, active deadline exceeded, etc.) that produce no useful container terminationMessage. It splits the existing Classify method into ClassifyContainerError (terminal PodFailed path) and ClassifyPodError (issue-handler path, receives the executor-captured error string), and validates rules at NewClassifier time so invalid regexes fail fast at startup.

Confidence Score: 5/5

Safe to merge; only P2 style findings, no logic or correctness issues.

All changes are additive and opt-in (no default rules shipped). The rename from Classify to ClassifyContainerError/ClassifyPodError is mechanical and fully covered by call-site updates. The only findings are P2: a test coverage gap and a minor doc comment observation.

internal/executor/categorizer/classifier_test.go — test dispatch heuristic leaves ClassifyPodError(pod, "") untested.

Important Files Changed

Filename Overview
internal/executor/categorizer/classifier.go Adds onPodError regex matcher to rule; splits Classify into ClassifyContainerError / ClassifyPodError backed by a private classify(pod, podErrorMessage); updates ruleMatches to pass the message and short-circuits correctly with the podErrorMessage != "" guard. Logic is clean and correct.
internal/executor/categorizer/classifier_test.go Adds four new test cases for onPodError; the dispatch heuristic (if podErrorMessage == "") never calls ClassifyPodError with an empty string, leaving that code path untested.
internal/executor/categorizer/types.go Adds OnPodError *errormatch.RegexMatcher field to CategoryRule with clear doc comments; no issues.
internal/executor/categorizer/doc.go Updates package-level documentation for the new matcher and the renamed API; incidentally removes a pre-existing AppError documentation error that never existed as a real condition constant.
internal/executor/service/pod_issue_handler.go Switches handleNonRetryableJobIssue to ClassifyPodError(podIssue.OriginalPodState, podIssue.Message), correctly forwarding the executor-captured error message for pod-level rule matching.
internal/executor/service/pod_issue_handler_test.go Adds TestPodIssueService_OnPodErrorClassifies with two end-to-end cases (platform mismatch via kubelet error, deadline exceeded); helper podErrorClassifier keeps setup clean.
internal/executor/service/job_state_reporter.go Correctly renames ClassifyClassifyContainerError for the PodFailed terminal path, which doesn't have access to a pod error message.
internal/executor/reporter/event_test.go Updates call site from classifier.Classify to classifier.ClassifyContainerError; purely mechanical rename, no logic change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pod event detected] --> B{Pod Phase}
    B -- PodFailed --> C[job_state_reporter.go\nreportCurrentStatus]
    C --> D[ClassifyContainerError\npod state only]
    D --> E[Match: onConditions\nonExitCodes\nonTerminationMessage]

    B -- Pending / Stuck / Deadline --> F[pod_issue_handler.go\ndetectPodIssues]
    F --> G{Retryable?}
    G -- Yes --> H[handleRetryableJobIssue\nReturn lease, no classification]
    G -- No --> I[handleNonRetryableJobIssue]
    I --> J[ClassifyPodError\npod state + podIssue.Message]
    J --> K[Match: onConditions\nonExitCodes\nonTerminationMessage\nonPodError NEW]
    K --> L[ClassifyResult\nCategory + Subcategory]
    E --> L
    L --> M[CreateJobFailedEvent\nfailure_category / failure_subcategory]
Loading

Reviews (6): Last reviewed commit: "Add onPodError matcher to categorize pre..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the categorizer-on-pod-error branch 5 times, most recently from 2d58d38 to e36ca03 Compare April 30, 2026 13:03
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the categorizer-on-pod-error branch from e36ca03 to d1690ff Compare April 30, 2026 13:10
@dejanzele
Copy link
Copy Markdown
Member Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants