-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Add retry to the scheduler loop to protect against DB hiccups #29815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add retry loop in case where sql query fails, this makes AF much more resilient to potential DB hiccups
Get retry count from config
Changed retryCount config to reuse existing configuration parameter
Migrate the range loop to use the run_with_db_retries method
Fixed bad commit
Fixed bad commit
Removed unnecessary break
Fixed logging string concatenation
|
Small thing, and I'm not sure if Airflow devs already have a policy about this but time.time() is not guaranteed to be monotonic so you could report a negative value when you do time.time() - start_time. You can use time.monotonic() or time.perf_counter() to get monotonic time. |
I made a change as @notatallshaw suggested. |
What multiple commits do you mean? I see the changes being related to heartbeat alone. Even in case of retry the changes executed in heartbeat block should be idempotent. |
|
When each (As an aside: I’m not even sure why two |
|
Second creation of session was introduced here: |
@uranusjr Yep. This is very awkward. And this awkwardness is precisely what I am going to change during AIP-44 implementation. There is an (very unfinished) draft commit: f11f5af that depends on merging #29776 which will solve this issue (by splitting the "before", "during", and "after" task into three steps (and three DB sessions when internal DB api won't be used - when the internal API will be used there will be no "during" session at all).
@bjankie1 Yes. I understand that and sympathise with such statement. But the solution to that is not to retry a failed transaction without looking at the consequences (pointed out by @uranusjr nicely) . The proposal of yours is a "band-aid" which might create more problems. The right approach for making system resilent to DB problems is to understand every single DB transaction that happens in the system and deliberately design behaviour of what happens if the transaction fails and act appropriately to recover. Retrying failed transaction without understanding of the consequences is a recipe for disaster. And yes in this case the whole transaction is ... awkward - as also nicely pointed out by @uranusjr. and the right approach is to fix the behaviour to make more sense (and so you can reason about it) and only after that to implement recovery scenario (which might actually be not needed - because the way I think it will work when I complete the refactor is that those transactions will be split into tthree (or two in case of internal DB API) so that there will be no need to recover becuse the problem you observe will not exist. |
Splitting the retried block into two.
|
@potiuk @uranusjr What do you think about using @retry_db_transaction for every create_session used (as proposed in base_job.run function)? I hope that this is acceptable. |
| # Make the session aware of this object | ||
| session.merge(self) | ||
| self.latest_heartbeat = timezone.utcnow() | ||
| session.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This existed in the old code but I don’t get why it’s needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the more I think, the more problematic this commit seems (I think there’s at least one other below). Say a database executes this commit, and then hiccups during heartbeat_callback. This entire block would be retried again with an extra update. That doesn’t feel right to me. We should perhaps either
a. remove this commit so the entire block shares one single transaction and gets rolled back and restarted when something bad happens, or
b. split this into two blocks (by the commit call), and retry them separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I share very similar concerns. I think we should be very deliberate on retrying full transactions, there are likely subtle issues that we don' realise by allowing retries mid-transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to remove session.commit().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the consequences if you do ? Do you know that ? What's the ground you propose it on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I did know, but I went though the code and checked SQLAlchemy to confirm, and I was wrong.
I reverted this proposition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@potiuk, @uranusjr I can not locate the second commit in heartbeat, unless it is implicite.
What do you think about moving this commit to the end of this block? I though that there were second commit (which I must have imagined myself) there.
Can there be a commit within heartbeat_callback function? If so, should we lean into excluding this callback from this transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach, which I was considering is to add commit as last operation in handle_db_transaction_with_session and to never have another write to db from the handle_db_transaction.
This way we have only a single update on the db, which will happen as last operation. If there were any problems, db should be clean at any time. Do you think that it could be feasible to have commit as last command and forbid using commit mid transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually now working on refactoring this part of the code in #30255, #30302 and #30308 (and one more to follow) in order to handle AIP-44 ( https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44 ) . There is already a part there to split the method into separate steps (#30308) and I think the discussion should happen around the time I do it there as it wil be changing slightly the behaviour of this particular transaction (this is not yet reviewed change so we might have more discussions there - for now I am waiting for the #30255 and #30302 to be reviewed and merged as they are prerequisite for that change.
|
would be good to take a look at this one after the refactoring is complete. Closing for nw - and @apilaskowski - please reopen with conflict resolved if you would like to pursue it - the changes are going to be released in 2.6 and you should base the changes on those. |
|
Ok. I will. |
I used @sternr https://github.com/apache/airflow/pull/28128 as a base for this PR.