Skip to content

Conversation

@Shide
Copy link

@Shide Shide commented Apr 9, 2024

Added requeue documentation to readme file

MT-5357 @moduon @yajo @rafaelbn @sbidoul @guewen please review if you want 😄

@Shide Shide marked this pull request as draft April 9, 2024 10:14
@Shide Shide force-pushed the 16.0-queue_unlock_jobs-queue_job branch from 11f845e to 3991b80 Compare April 9, 2024 10:21
@Shide Shide marked this pull request as ready for review April 9, 2024 10:25
@rafaelbn
Copy link
Member

rafaelbn commented Apr 9, 2024

@simahawk @gurneyalex you are welcome! 😄

Copy link
Member

@yajo yajo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few typos, but it all looks ok.

@Shide Shide force-pushed the 16.0-queue_unlock_jobs-queue_job branch from 3991b80 to 3c8922b Compare April 9, 2024 23:21
Copy link
Member

@sbidoul sbidoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you can have multiple Odoo instances running jobs and crons on different machines, there is actually no guarantee that the pid stored on the job is a pid of the machine trying to kill it. So it may be ineffective, or worse, killing an unrelated pid.

So I'm afraid we can't do this.

@yajo
Copy link
Member

yajo commented Apr 11, 2024

Yikes, true. However it's important to understand that we currently have a different problem.

Imagine a job that takes 10 minutes to execute. Maybe because it's slow, or maybe because it's buggy (e.g. a request without a timeout). After 5 minutes, the cron runs and reschedules it. Then it is picked up by another worker. Since the 1st worker still didn't end, the job will run twice. That's a race condition.

We could check the PID is currently running and belongs to an odoo process before terminating it.

Also, we could add this option as a parameter into the cron function directly (False by default). Benefits:

  • Users won't trigger this dangerous behavior by clicking a button.
  • Scenarios where this is safe can still enable that protection. Safe scenarios:
    • Odoo runs inside a container. Visible PIDs will only be those of the container. Not very likely that Postgres or other important processes are there.
    • Odoo runs as a low-privileged user. It won't be able to terminate other users' PIDs, even if it tries to. Own PIDs will most likely be Odoo jobs.

I'd like a better solution, such as being able to check if the job is actually running or not. But then the jobrunner should start 2 threads, one of which would be a keepalive one, or something like that. Way more complex... But do you have any other ideas?

@sbidoul
Copy link
Member

sbidoul commented Apr 11, 2024

But do you have any other ideas?

There was this idea of taking a lock on the job record (#423), so we can know for sure that some worker somewhere is still processing the job. I think it is feasible but it is tricky to get right.

Also, with the current implementation, if you configure the cron so the delay for re-queuing is greater than the CPU time limit of the odoo jobs workers, then you can be sure that the job will have been killed before being requeued.

@Shide
Copy link
Author

Shide commented Apr 11, 2024

With psutil library, we could hash some pid information (pid and create_time at least) and store on the job to ensure we are killing the right. Then we can check the process to ensure is running and is not zombie.

Parameter started_delta=0 passed to the function must be tweaked in every instance to ensure this won't trigger until Odoo has killed his own process (set on the parameters of the environment)

@yajo
Copy link
Member

yajo commented Apr 11, 2024

However, if we keep this in mind:

Also, with the current implementation, if you configure the cron so the delay for re-queuing is greater than the CPU time limit of the odoo jobs workers, then you can be sure that the job will have been killed before being requeued.

Knowing that Odoo will kill the worker after exceeding its allowed time, do we really need to kill it ourselves? Can't we just assume it's being killed and focus on properly rescheduling it?

@Shide
Copy link
Author

Shide commented Apr 11, 2024

However, if we keep this in mind:

Also, with the current implementation, if you configure the cron so the delay for re-queuing is greater than the CPU time limit of the odoo jobs workers, then you can be sure that the job will have been killed before being requeued.

Knowing that Odoo will kill the worker after exceeding its allowed time, do we really need to kill it ourselves? Can't we just assume it's being killed and focus on properly rescheduling it?

Yes, I think is not needed :/

I'm going to update README properly to include job reset configuration

@Shide Shide force-pushed the 16.0-queue_unlock_jobs-queue_job branch from 3c8922b to f0470e3 Compare April 11, 2024 10:21
@Shide Shide changed the title [FIX] [16.0] queue_job: Kill PIDs for requeued stucked jobs [FIX] [16.0] queue_job: Add requeue documentation to README Apr 11, 2024
@Shide Shide requested a review from sbidoul April 11, 2024 10:22
@OCA-git-bot
Copy link
Contributor

This PR has the approved label and has been created more than 5 days ago. It should therefore be ready to merge by a maintainer (or a PSC member if the concerned addon has no declared maintainer). 🤖

@Shide Shide force-pushed the 16.0-queue_unlock_jobs-queue_job branch from f0470e3 to c01eed5 Compare April 15, 2024 07:11
@Shide
Copy link
Author

Shide commented Apr 15, 2024

Done "-1" for using config parameters.
Changed the cron data and returning stuck jobs if anyone wants to do something with those processes.

@sbidoul @guewen @amh-mw

@Shide Shide force-pushed the 16.0-queue_unlock_jobs-queue_job branch 2 times, most recently from 49bee5b to 24f0dcd Compare April 15, 2024 08:01
@Shide Shide requested review from yajo April 15, 2024 08:13
@Shide Shide force-pushed the 16.0-queue_unlock_jobs-queue_job branch from 24f0dcd to cd78484 Compare April 15, 2024 09:20
@Shide
Copy link
Author

Shide commented Apr 15, 2024

All ready

@OCA-git-bot
Copy link
Contributor

This PR has the approved label and has been created more than 5 days ago. It should therefore be ready to merge by a maintainer (or a PSC member if the concerned addon has no declared maintainer). 🤖

Copy link
Member

@rafaelbn rafaelbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Shide !

@Shide
Copy link
Author

Shide commented May 7, 2024

@guewen can you merge this with your bless? 🙌🏻

@simahawk simahawk changed the title [FIX] [16.0] queue_job: Add requeue documentation to README [FIX] [16.0] queue_job: Add requeue default config parameter for started_delta + improve README May 7, 2024
Copy link
Contributor

@simahawk simahawk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sbidoul
Copy link
Member

sbidoul commented May 7, 2024

/ocabot merge patch

Thanks!

@OCA-git-bot
Copy link
Contributor

This PR looks fantastic, let's merge it!
Prepared branch 16.0-ocabot-merge-pr-642-by-sbidoul-bump-patch, awaiting test results.

@OCA-git-bot OCA-git-bot merged commit 436b11c into OCA:16.0 May 7, 2024
@OCA-git-bot
Copy link
Contributor

Congratulations, your PR was merged at dbfd111. Thanks a lot for contributing to OCA. ❤️

@Shide Shide deleted the 16.0-queue_unlock_jobs-queue_job branch May 10, 2024 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants