Add flexible failure handling by barisbalic · Pull Request #106 · que-rb/que

barisbalic · 2015-04-16T23:11:46Z

Hi Chris,

We've tried to come up with some modest changes that would allow us to implement the different kinds of failure behaviours that we currently use at GoCardless. This pull request contains everything that is necessary to get us going, and we've actually got a Job running on this in production right now.

We've written a gem called que-retry that makes use of these changes. We're hoping to open source it soon, but it's not quite ready. Looking at it might provide better context for our changes, so let us know if you'd like to take a look.

The current changes are:

The failure handling code from the work loop has been extracted into a method of it's own.
A retryable field has been introduced to stop "failed" jobs from being re-run constantly.
A retryable argument has been added to Job.
The new migration sets retryable to true for all existing jobs. This ensures users with existing jobs will be unaffected.
The lock_job query has been updated to only select jobs that are retryable.
A failed_at field has been added for introspection/diagnostic purposes.
The freeze on Que::SQL has been removed, because we needed to introduce more queries in our que-retry gem.

The last point is one we've been unsure about, it could be avoided by including one of our queries in Que itself, but by default it wouldn't be used by Que. It would be great to hear how you feel about this, and we'd be happy to show you que-retry. It's not quite finished, but it's pretty close, and will be open-sourced.

Another change that starts to make sense when you look at que-retry is the possibility of passing instances of Jobs to the handlers, rather than hashes. It would end up being a bit cleaner.

One last thing to note, is that we have been running our branch in queue-shootout against your master to make sure we've not had a detrimental impact on performance, so far it's a close match and seems like we've had literally no impact.

chanks · 2015-04-17T02:02:39Z

I'll look at this more closely tomorrow, when it's open-source day at work, but is there a reason you decided to get this route rather than inserting failed jobs into a separate table? That feels like it'd be cleaner to me than adding columns and leaving non-retryable jobs in the main job table. I like the idea of allowing failure/error behavior to be more customizable on a per-class basis, though.

Passing job objects themselves to the error handlers is something I had considered, but there will be some errors where it won't be possible to do so (like if a job class is renamed and rows with the old class are left in the DB, which could happen easily), and it felt more reasonable to just always pass hashes rather than sometimes jobs and sometimes hashes.

Anyway, I'll investigate more tomorrow afternoon. Thanks for the PR and the discussion!

Sinjo · 2015-04-17T11:18:18Z

We felt this way was slightly simpler, but there really wasn't much in it. With the separate table, we'd change the query to an INSERT/DELETE wrapped in a transaction, which I think would be safe.

…to PR #106.

chanks · 2015-04-17T20:24:33Z

I just wrote a section for the customizing Que doc on how you might configure certain failed jobs to be moved to a separate table. I actually found writing and implementing it to be helpful in feeling out the problem. So, my various thoughts:

I'm not a fan of keeping these jobs that won't be retried in the main table. I worry it'll bloat the queue and its index unnecessarily, and I don't like the idea of complicating the job lock query to handle their presence, even if it doesn't result in a noticeable performance issue most of the time.
The enqueue command feels like a bad place to be declaring whether a particular job is retryable or not - that logic feels like it belongs in the job class. If you decide you want to maintain a fork that keeps failed jobs in the same table, that's one thing I'd rethink if I were you.
I like the idea of making error handling overridable on the class or job instance level. That would allow for a lot more flexibility, though I'm not sure exactly what use cases you're looking to support. I guess I'm thinking an interface that would allow for things like (pseudocode):

class MyJob < Que::Job
  def handle_error(error)
    case error
    when NetworkError
      # Transient problem, default to exponential backoff technique.
      super
    when HostNotFoundError
      # Permanent problem, don't retry.
      destroy
    else
      # Other unknown problem. Wait one hour for a developer
      # to fix, but if it happens five times, just give up.
      if error_count =< 5
        retry_in(3600)
      else
        # Stash failed jobs somewhere they can be examined later.
        $redis.lpush "uncaught_errors", JSON.dump(@attrs)
      end
    end
  end
end

I think this would all be orthogonal to the error_handler proc, which is meant to hook into Honeybadger or another monitoring system, and which should always be engaged, regardless of the nature of the error.

What do you think? Does this pseudocode address the kinds of things you were wanting to be able to do? I'm of course also interested in any opinions anyone else has on what they'd like to be able to do with retry logic.

Sinjo · 2015-04-29T17:57:36Z

Sorry for the delayed response on this. I spent some of last week thinking through the retryable part of the query, but most of it setting up monitoring for Que on our live systems.

We've got 2 use cases so far, both using the failure callback. They're currently sat in a private repo, but I'll get that opened up in the next day or two (pending a bit of cleanup). Those use cases are:

A type of job which is executed with at-most-once delivery, reports any errors on failure, and remains in the queue if it fails at any point.
A type of job which does (customisable) backoff for a known list of exceptions. Other exceptions report a failure. Once the backoff schedule has been exhausted, the job is either destroyed, or a failure is reported.

As you pointed out, the failure handling here is separate to the global Que.error_handler. The reason we did that is that we don't want exception reports for things we're going to retry. For example, if a webhook isn't sent because of some transient network issue, it should be retried silently (perhaps incrementing a counter in a metrics system).

Job failure/retry

Going back to the retryable flag, a big part of the use case there is to provide the at-most-once semantics for non-retryable jobs. To do that, in the at-most-once job type, we issue an UPDATE query which sets the job as non-retryable before calling the job's run method. This means that if the job fails after being started by one worker, it won't be picked up by another.

Single vs multiple tables

There's a bit of a tradeoff for that query if you split it up into multiple tables. You have to run it in a transaction so that the INSERT/DELETE become visible at the same time. I kinda prefer the simplicity of the single UPDATE query, though I can see how having two tables would make things clearer.

I'm not too worried about table bloat with jobs that can't be run. For the non-retryable job type, failure is loud (calls a failure handler which in our case sends alerts). The idea there is that failures should be dealt with quickly, and that non-retryable jobs should be rare anyway, with most jobs being written with idempotence in mind.

I'm going to review the last changes to que-retry before we open it up now. I'll reply in here with a link once it's done. Keen to hear your thoughts. I'm sure there's places we can improve it!

Sinjo · 2015-05-08T15:59:51Z

As promised: https://github.com/gocardless/que-failure

Sorry for the delay. We were just tweaking a few last things before opening it up.

Sinjo · 2015-05-16T11:44:13Z

@chanks Wondering if you've had a chance to look through this. We've been using it in production for a few weeks now, and it's all looking stable.

chanks · 2015-05-16T20:29:03Z

I'm not really feeling any better about the design of adding another column to mark jobs that aren't to be retried and then leaving failed jobs in the same table. It may make sense for your use case, but it doesn't feel like a reasonable general default to me, so I don't think I'll be merging this PR as it is.

I'd like to get to some changes to make error/retry logic more flexible on a class/instance level before I ship 0.11.0 - I think that'll help everyone (including you) out a lot. And then if somebody wants to implement a custom solution that stashes failed jobs elsewhere (in another table, in Redis, emailing them, whatever) that should be easy to support without needing to modify Que itself.

I don't really have a clear vision of what I want those error/retry changes to look like yet, but I imagine I'll be working on it in the next week or two. I skimmed your use case discussion above back when you posted it, but I honestly haven't put a lot of time and thought into what the problems you're facing are and how more flexible error handling can address them, so I was leaving this open until I did that.

Sinjo · 2015-05-18T16:32:15Z

Ah, I'm sorry to hear that, though I understand you've got to weigh these things up in the context of other people using Que.

If it's the difference between getting this merged and maintaining a fork, I'm willing to reconsider the table structure. The objection to the flag on the job table comes as a bit of a surprise, as you seemed to have a preference towards an active flag (which we ended up calling retryable) when we chatted by email.

I understand if you don't want to merge these changes though. If that's the case, we'll carry on with our fork.

chanks · 2015-05-18T16:53:02Z

Yeah, I brought it up as a possible solution for your use case, but in
general I expect people will wind up leaving failed jobs hanging around
indefinitely and bloating the jobs table. If you have processes in place to
inspect failed jobs and clear them out every once in a while, it's an
easier sell, but I doubt most people will expend that effort.

On Mon, May 18, 2015 at 12:32 PM, Chris Sinjakli notifications@github.com
wrote:

Ah, I'm sorry to hear that, though I understand you've got to weigh these
things up in the context of other people using Que.

If it's the difference between getting this merged and maintaining a fork,
I'm willing to reconsider the table structure. The objection to the flag on
the job table comes as a bit of a surprise, as you seemed to have a
preference towards an active flag (which we ended up calling retryable)
when we chatted by email.

I understand if you don't want to merge these changes though. If that's
the case, we'll carry on with our fork.

—
Reply to this email directly or view it on GitHub
#106 (comment).

Sinjo · 2015-05-18T17:01:26Z

Ah, that makes more sense. We're adding an audit job to detect any stale non-retryable jobs left in the queue, and notify developers.

If it makes any difference - the default behaviour when running Que is unchanged. Jobs are only marked as non-retryable if you implement a custom failure handler, such as the ones we've released in Que::Failure.

chanks · 2015-05-20T18:12:02Z

Another thing I'm asking myself regarding this PR is, how many right ways
to do this are there? For example, I had the example code for how to do a
recurring job in the "customizing que" documentation for a while, since I
initially envisioned it as a simple hack on top of Que that would take ten
lines and could be easily customized, so we might as well give some example
code and then let people tweak it themselves for their exact situation. And
then, over time, as people started to use that example and comment on how
it could be fixed, I started to feel that recurring jobs were harder to
implement correctly than I thought, and that there was one way to implement
them that was most correct. So I changed my mind and wrote a
Que::RecurringJob class for 1.0 that people can just descend from, and it
does most of the work for them.

So far, I'm not convinced that this approach meets that bar. I don't think
there's just one correct way to do it (I think people could just as easily
want their failed jobs emailed to them or removed entirely as they would
keeping them around in the same table). And in the event that somebody does
want to safely stash their failed jobs in the database, I think the example
I put together in the customizing Que document of how to move a failed job
to a separate table is concise and will work just fine, without adding any
complexity to the library itself. I might be wrong, we'll see how people
use it and respond to it in the future. But for now I think the better
answer is to just make the error and retry logic as flexible as possible
and see what people want to do with it.

On Mon, May 18, 2015 at 1:01 PM, Chris Sinjakli notifications@github.com
wrote:

Ah, that makes more sense. We're adding an audit job to detect any stale
non-retryable jobs left in the queue, and notify developers.

If it makes any difference - the default behaviour when running Que is
unchanged. Jobs are only marked as non-retryable if you implement a custom
failure handler, such as the ones we've released in Que::Failure.

—
Reply to this email directly or view it on GitHub
#106 (comment).

Sinjo · 2015-06-03T23:18:27Z

Done some more thinking on this. I don't feel like our goals are so far apart. While we chose the two failure handling strategies which we put in Que::Failure, our intention was to make the API used for that flexible.

There's no reason other users have to implement handle_job_failure the way we do. By default, they'll get the behaviour Que has always had. If they choose to, they can move the job to a secondary store like Redis, send an email and destroy the job, or store it in a non-runnable state in Postgres and send out an alert as we do.

That hook is the core of the changes we've made, and the bit which opens up these options. The extra migration facilitates the option of storing the job in Postgres in a non-runnable state. We're happy to reconsider what that storage looks like. There are certainly advantages to it being a separate table, and the only downside I can think of is that you need to wrap an INSERT/DELETE in a transaction.

If you'd rather that part lived separately to Que, we can rejig our PR. It'll be a bit tougher on our side as we're in production now, but nothing we can't figure out.

chanks · 2015-06-04T00:01:29Z

Hey, I'm glad. I made an error-handling branch last week and pushed a couple small additions to support custom retry logic to it (https://github.com/chanks/que/compare/error-handling). The paint's not really dry on it, I'd love some feedback. I don't know how different it is from what you have in mind. It basically implements the pseudocode I included in my second reply above.

I'm actually a little iffy on how it tries to use the same method to customize retrying and error reporting - the handle_error method can do whatever it wants to destroy or reschedule the job (there's now a retry_in method to specify an amount of time to wait for a retry), and the method's return value dictates whether the error_handler (renamed to error_notifier, for clarity) is called or not. I don't know whether this scheme is elegant or confusing, though. 😛

My plan is to work on it some more on Friday. I'll try to look at your code then and figure out what's missing and if this is a good direction to go in.

Sinjo · 2015-07-07T13:47:57Z

Sorry about the silence on this. Been working on other things.

I've just looked over the error-handling branch, and the per-job handle_error method (handle_job_failure in our code) is the main reason we created a fork of Que.

Our que-failure gem provides two modules which implement that method, and fit our use case. We kept them separate as we didn't feel they needed to be in Que itself.

I'm still interested in the concept of failed jobs which won't be automatically retried, and where that functionality belongs. In a way, it'd be nice if it were part of Que, and it was just another method you could call. Having it in Que would mean people didn't have to think about subtle tradeoffs like separate tables vs an extra column in the main table. This work can be done in a separate branch. It depends on the per-job failure handling, but doesn't have to be added at the same time.

chanks · 2016-09-09T19:31:34Z

We discussed an alternate system in #147 that was just released in version 0.12.0.

Add flexible failure handling

4938220

chanks added a commit that referenced this pull request Apr 17, 2015

Add section on not retrying jobs to the customizing Que doc. Related …

795a9a6

…to PR #106.

tpitale mentioned this pull request Mar 23, 2016

[Proposal] Add retry limit global configuration #145

Closed

chanks mentioned this pull request Mar 25, 2016

[Proposal] Per-class error_handler #147

Closed

chanks closed this Sep 9, 2016

ahjmorton mentioned this pull request Dec 16, 2019

[BANKINT-231] Add failed label to gauge gocardless/que#54

Merged

stephenbinns mentioned this pull request Apr 26, 2022

Contribute improvements back upstream gocardless/que#82

Open

Comments

Conversation

barisbalic commented Apr 16, 2015

Uh oh!

chanks commented Apr 17, 2015

Uh oh!

Sinjo commented Apr 17, 2015

Uh oh!

chanks commented Apr 17, 2015

Uh oh!

Sinjo commented Apr 29, 2015

Job failure/retry

Single vs multiple tables

Uh oh!

Sinjo commented May 8, 2015

Uh oh!

Sinjo commented May 16, 2015

Uh oh!

chanks commented May 16, 2015

Uh oh!

Sinjo commented May 18, 2015

Uh oh!

chanks commented May 18, 2015

Uh oh!

Sinjo commented May 18, 2015

Uh oh!

chanks commented May 20, 2015

Uh oh!

Sinjo commented Jun 3, 2015

Uh oh!

chanks commented Jun 4, 2015

Uh oh!

Sinjo commented Jul 7, 2015

Uh oh!

chanks commented Sep 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants