[10.0][IMP] queue_job: requeue zombie jobs after hard shutdown #423

len-foss · 2022-04-11T12:31:25Z

No description provided.

OCA-git-bot · 2022-04-11T12:31:28Z

Hi @guewen,
some modules you are maintaining are being modified, check this out!

sbidoul · 2022-04-13T20:04:38Z

This is unfortunately not safe, because the job runner may be running on a different Odoo instance (possibly on another machine) than the jobs themselves. So it could be that the runner has crashed but the jobs are actually still running elsewhere.

len-foss · 2022-04-14T07:25:39Z

@sbidoul
Wouldn't it be sufficient to skip locked rows, if the runner is actually working on these?
For now, if the process crash late into the night, you might sometime in the morning get a call from end customer that things are not working, investigate a the poorly expressed report before finally fixing it. So now you are well into the work day, your products are not exported, you have 50K+ jobs to run, and you have to deal with transaction rollback in the following hours.
There needs to be a solution for this.

sbidoul · 2022-04-14T08:20:49Z

Wouldn't it be sufficient to skip locked rows, if the runner is actually working on these?

That is an interesting idea. Currently the RunJobController does not keep a lock on job records while executing the job, but we could experiment reacquiring the lock around here.

simahawk · 2022-04-14T08:24:43Z

isn't this tackled by this cron

queue/queue_job/models/queue_job.py

Line 279 in e0c5096

def requeue_stuck_jobs(self, enqueued_delta=5, started_delta=0):

?

len-foss · 2022-04-14T08:38:12Z

@simahawk
I would assume if there's a need to have requeue_stuck_jobs in production, that means something is horribly wrong that should be fixed instead.
The issue Stéphane mentions does not seem addressed by this method.
Last, it seems to me that to use it you'd have to compute an upper-bound on the time a job is supposed to take, so in case there's an issue slowing down the process this is going to make things much worse.

sbidoul · 2022-04-14T08:49:59Z

@len-foss I wouldnt' say horrible :) it's a hard problem for which no-one contributed a complete solution so far.

@simahawk I did not know that cron. For detecting stuck enqueued jobs it is reasonable (although I have seen heavy load situations were 5 seconds is too short for enqueued jobs to start). It requires manual tuning to detect jobs that crashed and were left in started state (and that part is disabled by default, as expected).

So yeah @len-foss's idea looks interesting to me: keep a lock on started jobs, and have something in the job runner loop that resets started jobs on which there is no lock. This new lock might have unexpected effects and backward compatibility implications, not sure, to be tested.

len-foss · 2022-04-19T17:05:19Z

@sbidoul
The situation you suggested is actually more complex than the one we have, so we won't really be able to test this patch.

sbidoul · 2022-04-19T17:19:35Z

@len-foss fair enough. Would you create an issue with your idea from #423 (comment) and close this PR ?

len-foss · 2022-04-21T07:13:09Z

@sbidoul
Sorry I was not clear, I already pushed that on this PR, but I don't like the idea of giving code that I can't really test.

sbidoul · 2022-04-21T07:46:14Z

queue_job/controllers/main.py

        http.request.env.cr.commit()

        _logger.debug('%s started', job)
+        job.lock()


Since there is a small window between the commit and here where the job state could be reset, maybe we should do job.lock(expected_state=STARTED) and not exit if the expected state is not correct.

sbidoul · 2022-04-21T07:46:58Z

@sbidoul Sorry I was not clear, I already pushed that on this PR, but I don't like the idea of giving code that I can't really test.

Ah I had not seen your last commit. It looks good to me. So let's keep this PR around until someone has a chance to battle test it.

sebalix · 2022-07-08T06:39:52Z

@len-foss @sbidoul we tested this PR and we got a lot of blocked queries on queue_job when an exception occurs like:

a normal exception happening during the execution of the job
an usual "concurrent update" issue (that Odoo retries)

Each time we had to kill these blocked queries and requeue the jobs. Without this PR we don't face anymore any issue.

It could be reproduced again locally with jobs raising an exception I guess, we didn't take time for that.

sbidoul · 2022-07-13T12:32:12Z

@sebalix you actually tested before we had a chance to do it :)

I believe the behavior you observe is because in case of exception we try to write on the job in a different transaction and that creates a deadlock because the main transaction still holds a lock on the job record.

So yeah, this PR is not going to work as is. Resetting to draft.

github-actions · 2022-11-13T12:37:08Z

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days.
If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

[IMP] queue_job: requeue zombie jobs after hard shutdown

057c916

len-foss force-pushed the 10.0-zombie-len branch from c9c6106 to 057c916 Compare April 19, 2022 15:06

sbidoul reviewed Apr 21, 2022

View reviewed changes

xavier-bouquiaux mentioned this pull request Jun 15, 2022

[14.0][IMP] queue_job: requeue zombie jobs after hard shutdown #439

Closed

sbidoul mentioned this pull request Jul 1, 2022

Automatically reset "Started/Enquired" jobs to "Pedning" on Odoo Start #386

Closed

sbidoul changed the title ~~[IMP] queue_job: requeue zombie jobs after hard shutdown~~ [10.0][IMP] queue_job: requeue zombie jobs after hard shutdown Jul 13, 2022

sbidoul marked this pull request as draft July 13, 2022 12:32

github-actions bot added the stale PR/Issue without recent activity, it'll be soon closed automatically. label Nov 13, 2022

github-actions bot closed this Dec 18, 2022

sbidoul mentioned this pull request Apr 11, 2024

[FIX] [16.0] queue_job: Add requeue default config parameter for started_delta + improve README #642

Merged

sbidoul mentioned this pull request Dec 4, 2024

[IMP] queue_job: detect jobs runned by workers that have been killed #713

Closed

AnizR mentioned this pull request Dec 6, 2024

[16.0][IMP] queue_job: remove dead jobs requeuer cron and automatically requeue dead jobs #716

Merged

Uh oh!

[10.0][IMP] queue_job: requeue zombie jobs after hard shutdown #423

[10.0][IMP] queue_job: requeue zombie jobs after hard shutdown #423

Uh oh!

Conversation

len-foss commented Apr 11, 2022

Uh oh!

OCA-git-bot commented Apr 11, 2022

Uh oh!

sbidoul commented Apr 13, 2022

Uh oh!

len-foss commented Apr 14, 2022

Uh oh!

sbidoul commented Apr 14, 2022

Uh oh!

simahawk commented Apr 14, 2022

Uh oh!

len-foss commented Apr 14, 2022

Uh oh!

sbidoul commented Apr 14, 2022

Uh oh!

len-foss commented Apr 19, 2022

Uh oh!

sbidoul commented Apr 19, 2022

Uh oh!

len-foss commented Apr 21, 2022

Uh oh!

sbidoul Apr 21, 2022

Choose a reason for hiding this comment

Uh oh!

sbidoul commented Apr 21, 2022

Uh oh!

sebalix commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbidoul commented Jul 13, 2022

Uh oh!

github-actions bot commented Nov 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sebalix commented Jul 8, 2022 •

edited

Loading