Skip to content

Replace FailureInfo with flat failure_category/failure_subcategory fields#4863

Merged
dejanzele merged 5 commits intoarmadaproject:masterfrom
dejanzele:slim-failure-classification-commits
Apr 23, 2026
Merged

Replace FailureInfo with flat failure_category/failure_subcategory fields#4863
dejanzele merged 5 commits intoarmadaproject:masterfrom
dejanzele:slim-failure-classification-commits

Conversation

@dejanzele
Copy link
Copy Markdown
Member

@dejanzele dejanzele commented Apr 22, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it

Reworks how pod failures get categorized in the executor and changes the wire format that carries the result downstream to Lookout.

Classifier: first-match-wins with subcategory

The executor classifier used to return a []string of every category whose rules matched a failed pod. That's hostile to metrics: a single failure counted as two or three categories double-counts in "failures by category" rollups, and there's no stable top-level bucket to alert on.

The classifier now returns a single (category, subcategory). Rules inside a category are evaluated in config order; the first rule that matches wins and contributes its optional subcategory. Categories are likewise evaluated in config order. If nothing matches, the classifier returns a configurable defaultCategory, falling back to the built-in "uncategorized" if one isn't set.

The top-level category stays low-cardinality for SLOs and alerts. Subcategory carries the drill-down detail without inflating the bucket count.

Feature flag

Categorization is gated on a new executor config field enableJobErrorCategorization, default false. When off, the classifier isn't constructed and no category is emitted, so existing deployments are unchanged until they opt in.

Wire format: flat scalars instead of a nested message

armadaevents.Error used to carry a FailureInfo submessage with exit_code, termination_message, categories []string, and container_name. Three of those four fields duplicate data that's already on ContainerError, and the []string shape matched the old classifier.

The FailureInfo message is removed, field 15 is reserved, and two new scalar fields are added to Error:

  • failure_category (16)
  • failure_subcategory (17)

The same flattening is applied to the public api.JobFailedEvent: its categories []string is replaced by the two scalars at the same tag numbers.

Lookout storage

Migration 032 adds two text columns to job_run: failure_category and failure_subcategory. The Lookout ingester writes these columns directly instead of packing a jsonb failure_info blob. Plain columns are cheaper to update than jsonb and support ordinary indexes if we want to query on them later.

The old failure_info jsonb column is left in place for now so the read path keeps working while it's migrated.

Follow-up

  • Drop the failure_info jsonb column once the Lookout read path and UI move to the new columns.
  • Add an executor counter job_failure_category_total labelled by category so the new classification is observable.

Which issue(s) this PR fixes

Special notes for your reviewer

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR replaces the FailureInfo proto sub-message with flat failure_category / failure_subcategory scalar fields on both armadaevents.Error and api.JobFailedEvent, refactors the executor classifier to a first-match-wins model with an optional per-rule subcategory, and lands a DB migration adding two varchar(63) columns to job_run. The old failure_info JSONB column and the enableJobRunFailureInfoMap feature flag are removed; the new columns are written directly by the Lookout ingester.

  • The new varchar(63) columns have no corresponding length validation in NewClassifier — category names or subcategory strings exceeding 63 characters will cause Postgres errors at ingestion time and stall the ingester batch.

Confidence Score: 4/5

Safe to merge after addressing the missing length validation against the varchar(63) DB constraint

One P1 issue: NewClassifier accepts arbitrarily long category and subcategory strings that the DB will reject at ingestion time. All other changes are clean and well-tested. The wire-format migration is backward-compatible (field 15 reserved, new fields at 16/17) and the feature is gated behind EnableJobErrorCategorization.

internal/executor/categorizer/classifier.go — needs length validation for category name and subcategory fields

Important Files Changed

Filename Overview
internal/executor/categorizer/classifier.go Classifier refactored to first-match-wins with subcategory; missing length validation against varchar(63) DB constraint for category/subcategory names
internal/executor/categorizer/types.go New ErrorCategoriesConfig wrapper type added with DefaultCategory, DefaultSubcategory, and Categories fields; clean addition
pkg/armadaevents/events.proto Removes FailureInfo message and field 15, reserves field 15, and adds failure_category (16) and failure_subcategory (17) scalars to Error; wire-format change is backward-compatible
pkg/api/event.proto Replaces repeated string categories (field 15, now reserved) with failure_category (16) and failure_subcategory (17) on JobFailedEvent
internal/lookout/schema/migrations/032_add_failure_category_to_job_run.sql Adds failure_category varchar(63) and failure_subcategory varchar(63) to job_run; old failure_info column preserved for backward compatibility
internal/lookoutingester/instructions/instructions.go Removes FailureInfo JSONB map logic and enableJobRunFailureInfoMap feature flag; writes failure_category/failure_subcategory scalars for terminal errors only
internal/lookoutingester/lookoutdb/insertion.go Updates batch and scalar UPDATE paths to use new failure_category/failure_subcategory columns; parameter numbering is correct
internal/executor/reporter/event.go Replaces FailureInfo proto with flat failure_category/failure_subcategory params; CreateSimpleJobFailedEvent signature simplified for preemption/submit-failure paths
internal/executor/service/pod_issue_handler.go handleNonRetryableJobIssue now calls CreateJobFailedEvent directly with actual container statuses instead of CreateSimpleJobFailedEvent with an empty slice — an incidental improvement
internal/executor/configuration/types.go Adds EnableJobErrorCategorization feature flag; ErrorCategories type changed from []CategoryConfig to ErrorCategoriesConfig
internal/server/event/conversion/conversions.go Converts GetFailureInfo().GetCategories() to the new GetFailureCategory()/GetFailureSubcategory() accessors on api.JobFailedEvent
internal/server/queryapi/database/query.sql.go Both GetJobRunsByJobIds and GetJobRunsByRunIds SELECTs updated to include the two new columns alongside the preserved failure_info column

Sequence Diagram

sequenceDiagram
    participant K8s as Kubernetes Pod
    participant Exec as Executor
    participant Cls as Classifier
    participant Rep as Event Reporter
    participant Puls as Pulsar
    participant Ing as Lookout Ingester
    participant DB as PostgreSQL (job_run)

    K8s->>Exec: Pod phase = Failed
    Exec->>Cls: Classify(pod)
    Note over Cls: First-match-wins across categories<br/>Returns (category, subcategory)
    Cls-->>Exec: ClassifyResult{Category, Subcategory}
    Exec->>Rep: CreateJobFailedEvent(..., category, subcategory)
    Rep->>Puls: armadaevents.Error{failure_category, failure_subcategory}
    Puls->>Ing: EventSequence
    Ing->>Ing: handleJobRunErrors()<br/>terminal=true → set category fields
    Ing->>DB: UPDATE job_run SET failure_category=$11, failure_subcategory=$12
Loading

Reviews (5): Last reviewed commit: "Update test fixtures, validators, and re..." | Re-trigger Greptile

Comment thread internal/lookout/schema/migrations/032_add_failure_category_to_job_run.sql Outdated
@dejanzele dejanzele changed the title Replace FailureInfo with flat failure_category/failure_subcategory fields (commit-by-commit) Replace FailureInfo with flat failure_category/failure_subcategory fields Apr 23, 2026
@dejanzele dejanzele force-pushed the slim-failure-classification-commits branch 3 times, most recently from c657a13 to 5d91ae9 Compare April 23, 2026 08:35
…ind feature flag

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…tegory scalars

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…r event conversion

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…ion 032)

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…re fields

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the slim-failure-classification-commits branch from 5d91ae9 to 4c7e640 Compare April 23, 2026 08:42
Comment on lines 44 to +70
@@ -45,16 +54,20 @@ func NewClassifier(configs []CategoryConfig) (*Classifier, error) {
return nil, fmt.Errorf("category %q must have at least one rule", cfg.Name)
}
cat := category{name: cfg.Name}
for i, rule := range cfg.Rules {
r, err := buildRule(rule)
for i, r := range cfg.Rules {
built, err := buildRule(r)
if err != nil {
return nil, fmt.Errorf("category %q rule %d: %w", cfg.Name, i, err)
}
cat.rules = append(cat.rules, r)
cat.rules = append(cat.rules, built)
}
categories = append(categories, cat)
}
return &Classifier{categories: categories}, nil
return &Classifier{
defaultCategory: config.DefaultCategory,
defaultSubcategory: config.DefaultSubcategory,
categories: categories,
}, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing length validation for varchar(63) DB constraint

NewClassifier validates empty names, duplicates, and malformed rules, but not the length of cfg.Name or rule Subcategory values. The migration and temp-table DDL both declare these columns as varchar(63). If an operator configures a category name or subcategory string longer than 63 characters, the Lookout ingester's batch UPDATE will fail with a PostgreSQL "value too long for type character varying(63)" error, stalling the ingestion pipeline for that batch.

Add a length guard during classifier construction:

		const maxLen = 63
		if len(cfg.Name) > maxLen {
			return nil, fmt.Errorf("category name %q exceeds maximum length %d", cfg.Name, maxLen)
		}
		...
		for i, r := range cfg.Rules {
			if len(r.Subcategory) > maxLen {
				return nil, fmt.Errorf("category %q rule %d: subcategory %q exceeds maximum length %d", cfg.Name, i, r.Subcategory, maxLen)
			}

The same check should cover config.DefaultCategory and config.DefaultSubcategory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: ^ - we should fix it in follow up PR straight after this one.

func CreateSimpleJobFailedEvent(pod *v1.Pod, reason string, debugMessage string, clusterId string, cause armadaevents.KubernetesReason, failureInfo *armadaevents.FailureInfo) (*armadaevents.EventSequence, error) {
return CreateJobFailedEvent(pod, reason, cause, debugMessage, []*armadaevents.ContainerError{}, clusterId, failureInfo)
// CreateSimpleJobFailedEvent creates a failed event with no container details and no classification.
// Use for failures where pod container statuses are unavailable (preemption, submit failures).
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: think about rejected jobs

Copy link
Copy Markdown
Contributor

@masipauskas masipauskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, in follow up PR, we should:

  • Add category/subcategory length validations when setting up categoriser from config.
  • We should consider how/if in any way we're handling rejected jobs (for now - I think - we just ignore them, for categorisation purpose.

Comment on lines 44 to +70
@@ -45,16 +54,20 @@ func NewClassifier(configs []CategoryConfig) (*Classifier, error) {
return nil, fmt.Errorf("category %q must have at least one rule", cfg.Name)
}
cat := category{name: cfg.Name}
for i, rule := range cfg.Rules {
r, err := buildRule(rule)
for i, r := range cfg.Rules {
built, err := buildRule(r)
if err != nil {
return nil, fmt.Errorf("category %q rule %d: %w", cfg.Name, i, err)
}
cat.rules = append(cat.rules, r)
cat.rules = append(cat.rules, built)
}
categories = append(categories, cat)
}
return &Classifier{categories: categories}, nil
return &Classifier{
defaultCategory: config.DefaultCategory,
defaultSubcategory: config.DefaultSubcategory,
categories: categories,
}, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: ^ - we should fix it in follow up PR straight after this one.

@dejanzele dejanzele merged commit 8494d0c into armadaproject:master Apr 23, 2026
32 checks passed
dejanzele added a commit that referenced this pull request Apr 23, 2026
#4867)

<!-- Thanks for sending a pull request! Here are some tips for you: -->

#### What type of PR is this?

bug-fix / follow-up

#### What this PR does / why we need it

Follow-up to #4863. PR #4863 added `failure_category varchar(63)` and
`failure_subcategory varchar(63)` columns to `job_run` (migration 032),
but `NewClassifier` accepted arbitrarily long values from config. An
operator setting a category name longer than 63 chars would only
discover the problem at the first failed pod, when the lookout
ingester's batch UPDATE hit `value too long for type character
varying(63)` and stalled the batch.

This PR adds length guards at config-load time so the executor refuses
to start with bad config rather than silently failing hours later. New
constant `maxCategoryNameLen = 63` documents the link to migration 032.
`NewClassifier` validates `DefaultCategory`, `DefaultSubcategory`, every
`cfg.Name`, and every rule's `Subcategory`.

#### Which issue(s) this PR fixes

Fixes #

#### Special notes for your reviewer

Addresses a review follow-up on #4863. A second follow-up from that
review (handling rejected jobs for categorization purposes) is
intentionally out of scope here - it's a design discussion rather than a
mechanical change and belongs in a separate issue/PR.

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants