Conversation
|
I'll look at this more closely tomorrow, when it's open-source day at work, but is there a reason you decided to get this route rather than inserting failed jobs into a separate table? That feels like it'd be cleaner to me than adding columns and leaving non-retryable jobs in the main job table. I like the idea of allowing failure/error behavior to be more customizable on a per-class basis, though. Passing job objects themselves to the error handlers is something I had considered, but there will be some errors where it won't be possible to do so (like if a job class is renamed and rows with the old class are left in the DB, which could happen easily), and it felt more reasonable to just always pass hashes rather than sometimes jobs and sometimes hashes. Anyway, I'll investigate more tomorrow afternoon. Thanks for the PR and the discussion! |
|
We felt this way was slightly simpler, but there really wasn't much in it. With the separate table, we'd change the query to an |
|
I just wrote a section for the customizing Que doc on how you might configure certain failed jobs to be moved to a separate table. I actually found writing and implementing it to be helpful in feeling out the problem. So, my various thoughts:
class MyJob < Que::Job
def handle_error(error)
case error
when NetworkError
# Transient problem, default to exponential backoff technique.
super
when HostNotFoundError
# Permanent problem, don't retry.
destroy
else
# Other unknown problem. Wait one hour for a developer
# to fix, but if it happens five times, just give up.
if error_count =< 5
retry_in(3600)
else
# Stash failed jobs somewhere they can be examined later.
$redis.lpush "uncaught_errors", JSON.dump(@attrs)
end
end
end
endI think this would all be orthogonal to the error_handler proc, which is meant to hook into Honeybadger or another monitoring system, and which should always be engaged, regardless of the nature of the error. What do you think? Does this pseudocode address the kinds of things you were wanting to be able to do? I'm of course also interested in any opinions anyone else has on what they'd like to be able to do with retry logic. |
|
Sorry for the delayed response on this. I spent some of last week thinking through the We've got 2 use cases so far, both using the failure callback. They're currently sat in a private repo, but I'll get that opened up in the next day or two (pending a bit of cleanup). Those use cases are:
As you pointed out, the failure handling here is separate to the global Job failure/retryGoing back to the Single vs multiple tablesThere's a bit of a tradeoff for that query if you split it up into multiple tables. You have to run it in a transaction so that the I'm not too worried about table bloat with jobs that can't be run. For the non-retryable job type, failure is loud (calls a failure handler which in our case sends alerts). The idea there is that failures should be dealt with quickly, and that non-retryable jobs should be rare anyway, with most jobs being written with idempotence in mind. I'm going to review the last changes to |
|
As promised: https://github.com/gocardless/que-failure Sorry for the delay. We were just tweaking a few last things before opening it up. |
|
@chanks Wondering if you've had a chance to look through this. We've been using it in production for a few weeks now, and it's all looking stable. |
|
I'm not really feeling any better about the design of adding another column to mark jobs that aren't to be retried and then leaving failed jobs in the same table. It may make sense for your use case, but it doesn't feel like a reasonable general default to me, so I don't think I'll be merging this PR as it is. I'd like to get to some changes to make error/retry logic more flexible on a class/instance level before I ship 0.11.0 - I think that'll help everyone (including you) out a lot. And then if somebody wants to implement a custom solution that stashes failed jobs elsewhere (in another table, in Redis, emailing them, whatever) that should be easy to support without needing to modify Que itself. I don't really have a clear vision of what I want those error/retry changes to look like yet, but I imagine I'll be working on it in the next week or two. I skimmed your use case discussion above back when you posted it, but I honestly haven't put a lot of time and thought into what the problems you're facing are and how more flexible error handling can address them, so I was leaving this open until I did that. |
|
Ah, I'm sorry to hear that, though I understand you've got to weigh these things up in the context of other people using Que. If it's the difference between getting this merged and maintaining a fork, I'm willing to reconsider the table structure. The objection to the flag on the job table comes as a bit of a surprise, as you seemed to have a preference towards an I understand if you don't want to merge these changes though. If that's the case, we'll carry on with our fork. |
|
Yeah, I brought it up as a possible solution for your use case, but in On Mon, May 18, 2015 at 12:32 PM, Chris Sinjakli notifications@github.com
|
|
Ah, that makes more sense. We're adding an audit job to detect any stale non-retryable jobs left in the queue, and notify developers. If it makes any difference - the default behaviour when running Que is unchanged. Jobs are only marked as non-retryable if you implement a custom failure handler, such as the ones we've released in |
|
Another thing I'm asking myself regarding this PR is, how many right ways So far, I'm not convinced that this approach meets that bar. I don't think On Mon, May 18, 2015 at 1:01 PM, Chris Sinjakli notifications@github.com
|
|
Done some more thinking on this. I don't feel like our goals are so far apart. While we chose the two failure handling strategies which we put in There's no reason other users have to implement That hook is the core of the changes we've made, and the bit which opens up these options. The extra migration facilitates the option of storing the job in Postgres in a non-runnable state. We're happy to reconsider what that storage looks like. There are certainly advantages to it being a separate table, and the only downside I can think of is that you need to wrap an If you'd rather that part lived separately to Que, we can rejig our PR. It'll be a bit tougher on our side as we're in production now, but nothing we can't figure out. |
|
Hey, I'm glad. I made an I'm actually a little iffy on how it tries to use the same method to customize retrying and error reporting - the handle_error method can do whatever it wants to destroy or reschedule the job (there's now a retry_in method to specify an amount of time to wait for a retry), and the method's return value dictates whether the error_handler (renamed to error_notifier, for clarity) is called or not. I don't know whether this scheme is elegant or confusing, though. 😛 My plan is to work on it some more on Friday. I'll try to look at your code then and figure out what's missing and if this is a good direction to go in. |
|
Sorry about the silence on this. Been working on other things. I've just looked over the Our I'm still interested in the concept of failed jobs which won't be automatically retried, and where that functionality belongs. In a way, it'd be nice if it were part of Que, and it was just another method you could call. Having it in Que would mean people didn't have to think about subtle tradeoffs like separate tables vs an extra column in the main table. This work can be done in a separate branch. It depends on the per-job failure handling, but doesn't have to be added at the same time. |
|
We discussed an alternate system in #147 that was just released in version 0.12.0. |
Hi Chris,
We've tried to come up with some modest changes that would allow us to implement the different kinds of failure behaviours that we currently use at GoCardless. This pull request contains everything that is necessary to get us going, and we've actually got a Job running on this in production right now.
We've written a gem called
que-retrythat makes use of these changes. We're hoping to open source it soon, but it's not quite ready. Looking at it might provide better context for our changes, so let us know if you'd like to take a look.The current changes are:
retryablefield has been introduced to stop "failed" jobs from being re-run constantly.retryableargument has been added toJob.retryableto true for all existing jobs. This ensures users with existing jobs will be unaffected.lock_jobquery has been updated to only select jobs that areretryable.failed_atfield has been added for introspection/diagnostic purposes.freezeonQue::SQLhas been removed, because we needed to introduce more queries in ourque-retrygem.The last point is one we've been unsure about, it could be avoided by including one of our queries in
Queitself, but by default it wouldn't be used byQue. It would be great to hear how you feel about this, and we'd be happy to show youque-retry. It's not quite finished, but it's pretty close, and will be open-sourced.Another change that starts to make sense when you look at
que-retryis the possibility of passing instances of Jobs to the handlers, rather than hashes. It would end up being a bit cleaner.One last thing to note, is that we have been running our branch in
queue-shootoutagainst your master to make sure we've not had a detrimental impact on performance, so far it's a close match and seems like we've had literally no impact.