Replace FailureInfo with flat failure_category/failure_subcategory fields#4863
Conversation
Greptile SummaryThis PR replaces the
Confidence Score: 4/5Safe to merge after addressing the missing length validation against the varchar(63) DB constraint One P1 issue: NewClassifier accepts arbitrarily long category and subcategory strings that the DB will reject at ingestion time. All other changes are clean and well-tested. The wire-format migration is backward-compatible (field 15 reserved, new fields at 16/17) and the feature is gated behind EnableJobErrorCategorization. internal/executor/categorizer/classifier.go — needs length validation for category name and subcategory fields Important Files Changed
Sequence DiagramsequenceDiagram
participant K8s as Kubernetes Pod
participant Exec as Executor
participant Cls as Classifier
participant Rep as Event Reporter
participant Puls as Pulsar
participant Ing as Lookout Ingester
participant DB as PostgreSQL (job_run)
K8s->>Exec: Pod phase = Failed
Exec->>Cls: Classify(pod)
Note over Cls: First-match-wins across categories<br/>Returns (category, subcategory)
Cls-->>Exec: ClassifyResult{Category, Subcategory}
Exec->>Rep: CreateJobFailedEvent(..., category, subcategory)
Rep->>Puls: armadaevents.Error{failure_category, failure_subcategory}
Puls->>Ing: EventSequence
Ing->>Ing: handleJobRunErrors()<br/>terminal=true → set category fields
Ing->>DB: UPDATE job_run SET failure_category=$11, failure_subcategory=$12
Reviews (5): Last reviewed commit: "Update test fixtures, validators, and re..." | Re-trigger Greptile |
c657a13 to
5d91ae9
Compare
…ind feature flag Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…tegory scalars Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…r event conversion Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…ion 032) Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
…re fields Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
5d91ae9 to
4c7e640
Compare
| @@ -45,16 +54,20 @@ func NewClassifier(configs []CategoryConfig) (*Classifier, error) { | |||
| return nil, fmt.Errorf("category %q must have at least one rule", cfg.Name) | |||
| } | |||
| cat := category{name: cfg.Name} | |||
| for i, rule := range cfg.Rules { | |||
| r, err := buildRule(rule) | |||
| for i, r := range cfg.Rules { | |||
| built, err := buildRule(r) | |||
| if err != nil { | |||
| return nil, fmt.Errorf("category %q rule %d: %w", cfg.Name, i, err) | |||
| } | |||
| cat.rules = append(cat.rules, r) | |||
| cat.rules = append(cat.rules, built) | |||
| } | |||
| categories = append(categories, cat) | |||
| } | |||
| return &Classifier{categories: categories}, nil | |||
| return &Classifier{ | |||
| defaultCategory: config.DefaultCategory, | |||
| defaultSubcategory: config.DefaultSubcategory, | |||
| categories: categories, | |||
| }, nil | |||
There was a problem hiding this comment.
Missing length validation for
varchar(63) DB constraint
NewClassifier validates empty names, duplicates, and malformed rules, but not the length of cfg.Name or rule Subcategory values. The migration and temp-table DDL both declare these columns as varchar(63). If an operator configures a category name or subcategory string longer than 63 characters, the Lookout ingester's batch UPDATE will fail with a PostgreSQL "value too long for type character varying(63)" error, stalling the ingestion pipeline for that batch.
Add a length guard during classifier construction:
const maxLen = 63
if len(cfg.Name) > maxLen {
return nil, fmt.Errorf("category name %q exceeds maximum length %d", cfg.Name, maxLen)
}
...
for i, r := range cfg.Rules {
if len(r.Subcategory) > maxLen {
return nil, fmt.Errorf("category %q rule %d: subcategory %q exceeds maximum length %d", cfg.Name, i, r.Subcategory, maxLen)
}The same check should cover config.DefaultCategory and config.DefaultSubcategory.
There was a problem hiding this comment.
TODO: ^ - we should fix it in follow up PR straight after this one.
| func CreateSimpleJobFailedEvent(pod *v1.Pod, reason string, debugMessage string, clusterId string, cause armadaevents.KubernetesReason, failureInfo *armadaevents.FailureInfo) (*armadaevents.EventSequence, error) { | ||
| return CreateJobFailedEvent(pod, reason, cause, debugMessage, []*armadaevents.ContainerError{}, clusterId, failureInfo) | ||
| // CreateSimpleJobFailedEvent creates a failed event with no container details and no classification. | ||
| // Use for failures where pod container statuses are unavailable (preemption, submit failures). |
There was a problem hiding this comment.
TODO: think about rejected jobs
masipauskas
left a comment
There was a problem hiding this comment.
LGTM, in follow up PR, we should:
- Add category/subcategory length validations when setting up categoriser from config.
- We should consider how/if in any way we're handling rejected jobs (for now - I think - we just ignore them, for categorisation purpose.
| @@ -45,16 +54,20 @@ func NewClassifier(configs []CategoryConfig) (*Classifier, error) { | |||
| return nil, fmt.Errorf("category %q must have at least one rule", cfg.Name) | |||
| } | |||
| cat := category{name: cfg.Name} | |||
| for i, rule := range cfg.Rules { | |||
| r, err := buildRule(rule) | |||
| for i, r := range cfg.Rules { | |||
| built, err := buildRule(r) | |||
| if err != nil { | |||
| return nil, fmt.Errorf("category %q rule %d: %w", cfg.Name, i, err) | |||
| } | |||
| cat.rules = append(cat.rules, r) | |||
| cat.rules = append(cat.rules, built) | |||
| } | |||
| categories = append(categories, cat) | |||
| } | |||
| return &Classifier{categories: categories}, nil | |||
| return &Classifier{ | |||
| defaultCategory: config.DefaultCategory, | |||
| defaultSubcategory: config.DefaultSubcategory, | |||
| categories: categories, | |||
| }, nil | |||
There was a problem hiding this comment.
TODO: ^ - we should fix it in follow up PR straight after this one.
#4867) <!-- Thanks for sending a pull request! Here are some tips for you: --> #### What type of PR is this? bug-fix / follow-up #### What this PR does / why we need it Follow-up to #4863. PR #4863 added `failure_category varchar(63)` and `failure_subcategory varchar(63)` columns to `job_run` (migration 032), but `NewClassifier` accepted arbitrarily long values from config. An operator setting a category name longer than 63 chars would only discover the problem at the first failed pod, when the lookout ingester's batch UPDATE hit `value too long for type character varying(63)` and stalled the batch. This PR adds length guards at config-load time so the executor refuses to start with bad config rather than silently failing hours later. New constant `maxCategoryNameLen = 63` documents the link to migration 032. `NewClassifier` validates `DefaultCategory`, `DefaultSubcategory`, every `cfg.Name`, and every rule's `Subcategory`. #### Which issue(s) this PR fixes Fixes # #### Special notes for your reviewer Addresses a review follow-up on #4863. A second follow-up from that review (handling rejected jobs for categorization purposes) is intentionally out of scope here - it's a design discussion rather than a mechanical change and belongs in a separate issue/PR. Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
What type of PR is this?
/kind feature
What this PR does / why we need it
Reworks how pod failures get categorized in the executor and changes the wire format that carries the result downstream to Lookout.
Classifier: first-match-wins with subcategory
The executor classifier used to return a
[]stringof every category whose rules matched a failed pod. That's hostile to metrics: a single failure counted as two or three categories double-counts in "failures by category" rollups, and there's no stable top-level bucket to alert on.The classifier now returns a single
(category, subcategory). Rules inside a category are evaluated in config order; the first rule that matches wins and contributes its optionalsubcategory. Categories are likewise evaluated in config order. If nothing matches, the classifier returns a configurabledefaultCategory, falling back to the built-in"uncategorized"if one isn't set.The top-level category stays low-cardinality for SLOs and alerts. Subcategory carries the drill-down detail without inflating the bucket count.
Feature flag
Categorization is gated on a new executor config field
enableJobErrorCategorization, defaultfalse. When off, the classifier isn't constructed and no category is emitted, so existing deployments are unchanged until they opt in.Wire format: flat scalars instead of a nested message
armadaevents.Errorused to carry aFailureInfosubmessage withexit_code,termination_message,categories []string, andcontainer_name. Three of those four fields duplicate data that's already onContainerError, and the[]stringshape matched the old classifier.The
FailureInfomessage is removed, field 15 is reserved, and two new scalar fields are added toError:failure_category(16)failure_subcategory(17)The same flattening is applied to the public
api.JobFailedEvent: itscategories []stringis replaced by the two scalars at the same tag numbers.Lookout storage
Migration 032 adds two
textcolumns tojob_run:failure_categoryandfailure_subcategory. The Lookout ingester writes these columns directly instead of packing a jsonbfailure_infoblob. Plain columns are cheaper to update than jsonb and support ordinary indexes if we want to query on them later.The old
failure_infojsonb column is left in place for now so the read path keeps working while it's migrated.Follow-up
failure_infojsonb column once the Lookout read path and UI move to the new columns.job_failure_category_totallabelled by category so the new classification is observable.Which issue(s) this PR fixes
Special notes for your reviewer