Skip to content

Comments

[BANKINT-231] Add failed label to gauge#54

Merged
ahjmorton merged 1 commit intomasterfrom
BANKINT-231-add-failed-job-guage
Dec 17, 2019
Merged

[BANKINT-231] Add failed label to gauge#54
ahjmorton merged 1 commit intomasterfrom
BANKINT-231-add-failed-job-guage

Conversation

@ahjmorton
Copy link

Ticket

Add a failed label to queue view metrics to allow monitoring / alerting on failed jobs.

Approach

Initially the idea here was to have separate gauge with its own SQL query to retrieve information.
However by having a single gauge with multiple labels we do get some useful properties :

  • We can aggregate at either the job class or the queue level using sum in Promethesus.
  • We do not have another SQL query to execute, so we shouldn't see performance changes.

To identify a job as failed we look for jobs where error count is greater than 0 and have the retryable flag set to false. This is in line with our identification of failed jobs elsewhere.

QUEUE_VIEW_SQL = <<~SQL
select queue, job_class, priority
, (case when (retryable AND run_at < now()) then 'true' else 'false' end) as due
, (case when (NOT retryable AND error_count > 0) then 'true' else 'false' end) as failed
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could do (NOT retryable AND failed_at IS NOT NULL) I opted for this approach as failed_at isn't updated anywhere in this repository : it is updated over in the que-failure gem.

Given the difference in approaches / legacy here I can't be certain which one is more correct so any comments would be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This project really needs some tlc. Ideally que-failure would be part of this gem, and we'd clean out all the rest of the crap. Given it's not, just assume we only ever use it with que-failure for now

Copy link
Contributor

@lawrencejones lawrencejones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this looks fine. Let's get it on staging and see

@ahjmorton ahjmorton merged commit 6a9c57d into master Dec 17, 2019
@ahjmorton ahjmorton deleted the BANKINT-231-add-failed-job-guage branch December 17, 2019 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants