Skip to content

Add diagnostic hint catalog for well-known executor failure modes#4890

Open
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:executor-diagnostic-hints
Open

Add diagnostic hint catalog for well-known executor failure modes#4890
dejanzele wants to merge 1 commit intoarmadaproject:masterfrom
dejanzele:executor-diagnostic-hints

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 30, 2026

Summary

When a job fails for a well-known reason (e.g. wrong-arch image), the user-facing error in Lookout is the raw kubelet text, often opaque. This PR adds a diagnostics package that maps such errors to actionable hints; the hint is prepended to the failure message at the three executor sites that build user-facing text (reporter/event.go PodFailed branch, service/pod_issue_handler.go retryable failed-pod path, and service/pod_issue_handler.go createStuckPodMessage).

The package is separate from the categorizer because the two are orthogonal: diagnostic hints are always-on and curated, categorization is opt-in and operator-configured. Splitting them avoids forcing a categorizer dependency on hint consumers and lets either side evolve without churning the other.

Adding a hint is a single {pattern, text} entry in builtinHints. Patterns compile at startup; first match wins. A future PR may move the catalog from compiled-in to config-driven once we see whether operators want deployment-specific entries.

Validation

To reproduce on local dev:

1. Submit a wrong-arch job (example/platform-mismatch.yaml):

queue: test
jobSetId: platform-mismatch-repro
jobs:
  - namespace: default
    priority: 0
    podSpec:
      terminationGracePeriodSeconds: 0
      restartPolicy: Never
      containers:
        - name: wrong-arch
          # On an arm64 cluster, force-pull the amd64 variant.
          # Flip to arm64v8/busybox if reproducing on an amd64 cluster.
          image: amd64/busybox:latest
          command:
            - sh
            - -c
            - echo should-never-run
          resources:
            requests:
              memory: 64Mi
              cpu: "0.1"
            limits:
              memory: 64Mi
              cpu: "0.1"
armadactl create queue test   # if not already present
armadactl submit example/platform-mismatch.yaml

2. Wait for the kubelet event-based fail check to fire (typically 1-5 minutes depending on event_checks grace period).

3. Verify the hint appears in the lookout DB. The error column is zlib-compressed bytea, so decode in two steps:

# Failure metadata
docker exec postgres psql -U postgres -d lookout -c \
  "SELECT job_id, run_id, finished, exit_code
   FROM job_run
   ORDER BY finished DESC NULLS LAST
   LIMIT 1;"

# Decompressed error message (the byte string the Lookout UI shows users)
docker exec postgres psql -U postgres -At -d lookout -c \
  "SELECT encode(error, 'base64') FROM job_run
   ORDER BY finished DESC NULLS LAST LIMIT 1;" \
| python3 -c 'import sys,base64,zlib; print(zlib.decompress(base64.b64decode(sys.stdin.read())).decode())'

Expected (and observed) result

Decompressed error column begins with the curated hint, followed by the raw kubelet error preserved verbatim:

No compatible image was found for the target farm node's architecture; this is often due to an x64/arm64 mismatch, so check the image architecture or build and publish a compatible version.
Container ... has been in state Waiting for reason ImagePullBackOff (Failed to pull image "amd64/busybox:latest": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/amd64/busybox:latest": no match for platform in manifest: not found) for more than timeout 1ns

Live-validated end-to-end on macOS arm64 (M3) against a k3d cluster.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR introduces internal/executor/diagnostics, a curated hint catalog that prepends user-facing guidance to opaque kubelet error messages. The hint is injected at three independent failure-message construction sites — direct pod failure reporting (event.go), retryable failed-pod registration, and stuck-pending pod detection — each wiring in diagnostics.LookupHint before storing or emitting the message.

The implementation is correct: job_state_reporter.go ensures these three paths are mutually exclusive for any given pod event, so the hint is applied exactly once regardless of which path handles the failure. The createStuckPodMessage refactor correctly preserves the retryability preface (\"Unable to start pod.\" vs \"Unable to start pod - encountered an unrecoverable problem.\") when a hint fires, addressing what had been flagged in a previous review.

Confidence Score: 5/5

Safe to merge — hint injection is additive, paths are mutually exclusive, and the retryability preface is correctly preserved.

No P0 or P1 findings. All three injection sites are correct and non-overlapping. The previously flagged retryable-preface issue is resolved in this version. Test coverage spans all four retryable/non-retryable x hint/no-hint combinations.

No files require special attention.

Important Files Changed

Filename Overview
internal/executor/diagnostics/diagnostics.go New package providing LookupHint; clean regex catalog with one platform-mismatch entry, thread-safe, compiled at init time.
internal/executor/diagnostics/diagnostics_test.go Unit tests cover empty message, unknown failure, and platform-mismatch match; straightforward and complete for the current catalog size.
internal/executor/diagnostics/doc.go Package doc only; no logic changes.
internal/executor/reporter/event.go Adds hint injection in the PodFailed branch before calling CreateJobFailedEvent; hint is applied to ExtractPodFailedReason output cleanly.
internal/executor/service/pod_issue_handler.go Hint injected at two sites: retryable failed-pod registration and createStuckPodMessage; retryability preface correctly preserved when hint fires.
internal/executor/service/pod_issue_handler_test.go New TestCreateStuckPodMessage covers all four retryable/non-retryable x hint/no-hint combinations; verifies prefix, hint presence, and original message inclusion.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pod State Change Detected] --> B{Pod Phase?}

    B -->|PodFailed| C[job_state_reporter.go\nCreateEventForCurrentState]
    C --> D[ExtractPodFailedReason]
    D --> E[diagnostics.LookupHint]
    E -->|hint found| F[reason = hint + reason]
    E -->|no hint| G[reason unchanged]
    F --> H[CreateJobFailedEvent]
    G --> H
    H --> I{DetectAndRegisterFailedPodIssue?}
    I -->|issueAdded=true retryable| J[Store podIssue.Message\nhint + original\nDiscard event]
    I -->|issueAdded=false non-retryable| K[Queue event with hint]
    J --> L[handleRetryableJobIssue\nReturn lease / retry]

    B -->|PodPending/Unknown| M[detectPodIssues\npendingPodChecker.GetAction]
    M -->|action != Wait| N[createStuckPodMessage\nretryable bool + originalMessage]
    N --> O[diagnostics.LookupHint]
    O -->|hint found| P[preface + hint + original]
    O -->|no hint| Q[preface + original]
    P --> R[Store podIssue.Message]
    Q --> R
    R --> S[handleNonRetryableJobIssue\nCreateJobFailedEvent with stored message]
Loading

Reviews (4): Last reviewed commit: "Add diagnostic hint catalog for well-kno..." | Re-trigger Greptile

@dejanzele dejanzele force-pushed the executor-diagnostic-hints branch 2 times, most recently from 3a190a2 to 5190047 Compare April 30, 2026 11:48
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the executor-diagnostic-hints branch from 5190047 to 96631a6 Compare April 30, 2026 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants