Add retry to the scheduler loop to protect against DB hiccups #29815

apilaskowski · 2023-02-28T17:04:24Z

I used @sternr https://github.com/apache/airflow/pull/28128 as a base for this PR.

Add retry loop in case where sql query fails, this makes AF much more resilient to potential DB hiccups

Get retry count from config

Changed retryCount config to reuse existing configuration parameter

Migrate the range loop to use the run_with_db_retries method

Fixed bad commit

Removed unnecessary break

Fixed logging string concatenation

apilaskowski · 2023-02-28T17:06:26Z

@potiuk @kaxil @uranusjr
Please let me know if I should make any changes.

notatallshaw · 2023-02-28T18:57:14Z

Small thing, and I'm not sure if Airflow devs already have a policy about this but time.time() is not guaranteed to be monotonic so you could report a negative value when you do time.time() - start_time.

You can use time.monotonic() or time.perf_counter() to get monotonic time.

…allshaw

…allshaw.

apilaskowski · 2023-03-01T15:16:17Z

Small thing, and I'm not sure if Airflow devs already have a policy about this but time.time() is not guaranteed to be monotonic so you could report a negative value when you do time.time() - start_time.

You can use time.monotonic() or time.perf_counter() to get monotonic time.

I made a change as @notatallshaw suggested.

bjankie1 · 2023-03-07T07:56:02Z

There are multiple db commits in the block and I wonder if the internal state would be messed up if the db fails midway through and the entire block got rerun.

What multiple commits do you mean? I see the changes being related to heartbeat alone. Even in case of retry the changes executed in heartbeat block should be idempotent.

uranusjr · 2023-03-07T08:09:36Z

When each create_session block exits, it calls session.commit() and writes to the database. So say if a database failure happens during lines 226–232, the entire block would be restarted and you end up writing the database twice without calling heartbeat_callback correctly. I don’t know if it would be problematic, but it is awkward. Ideally I would imagine the two create_session blocks should be retried separately instead.

(As an aside: I’m not even sure why two create_session blocks are needed here in the first place. Could someone do some git blame detective work and find out if there is a reason behind it?)

apilaskowski · 2023-03-07T10:27:50Z

Second creation of session was introduced here:
4905a55

potiuk · 2023-03-07T23:02:35Z

When each create_session block exits, it calls session.commit() and writes to the database. So say if a database failure happens during lines 226–232, the entire block would be restarted and you end up writing the database twice without calling heartbeat_callback correctly. I don’t know if it would be problematic, but it is awkward. Ideally I would imagine the two create_session blocks should be retried separately instead.

@uranusjr Yep. This is very awkward. And this awkwardness is precisely what I am going to change during AIP-44 implementation. There is an (very unfinished) draft commit: f11f5af that depends on merging #29776 which will solve this issue (by splitting the "before", "during", and "after" task into three steps (and three DB sessions when internal DB api won't be used - when the internal API will be used there will be no "during" session at all).

@potiuk DB connection failures happen. Airflow is running in distributed environment. As much as we want to have a reliable infrastructure it's not possible all the time. Heartbeat is the most frequent operation in the airflow DB and it's success has a significant impact on the task success or failure. What problems do you anticipate with making it more resilient to disruptions?

@bjankie1 Yes. I understand that and sympathise with such statement. But the solution to that is not to retry a failed transaction without looking at the consequences (pointed out by @uranusjr nicely) . The proposal of yours is a "band-aid" which might create more problems. The right approach for making system resilent to DB problems is to understand every single DB transaction that happens in the system and deliberately design behaviour of what happens if the transaction fails and act appropriately to recover. Retrying failed transaction without understanding of the consequences is a recipe for disaster. And yes in this case the whole transaction is ... awkward - as also nicely pointed out by @uranusjr. and the right approach is to fix the behaviour to make more sense (and so you can reason about it) and only after that to implement recovery scenario (which might actually be not needed - because the way I think it will work when I complete the refactor is that those transactions will be split into tthree (or two in case of internal DB API) so that there will be no need to recover becuse the problem you observe will not exist.

Splitting the retried block into two.

apilaskowski · 2023-03-22T16:00:08Z

@potiuk @uranusjr
What do you think about using @retry_db_transaction, which is already used extensively across Airflow?
I proposed a solution, which is better that previous one.
Now, I am not retrying whole blocks of code, which was a bad approach.

What do you think about using @retry_db_transaction for every create_session used (as proposed in base_job.run function)?

I hope that this is acceptable.
My internal tests suggests that it is enough to prevent temporary unavailable db (for a very short period of time).

airflow/jobs/base_job.py

uranusjr · 2023-03-23T10:51:03Z

airflow/jobs/base_job.py

+            # Make the session aware of this object
+            session.merge(self)
+            self.latest_heartbeat = timezone.utcnow()
+            session.commit()


This existed in the old code but I don’t get why it’s needed.

Hmm, the more I think, the more problematic this commit seems (I think there’s at least one other below). Say a database executes this commit, and then hiccups during heartbeat_callback. This entire block would be retried again with an extra update. That doesn’t feel right to me. We should perhaps either

a. remove this commit so the entire block shares one single transaction and gets rolled back and restarted when something bad happens, or
b. split this into two blocks (by the commit call), and retry them separately

Yeah. I share very similar concerns. I think we should be very deliberate on retrying full transactions, there are likely subtle issues that we don' realise by allowing retries mid-transaction.

I propose to remove session.commit().

What would be the consequences if you do ? Do you know that ? What's the ground you propose it on?

I thought I did know, but I went though the code and checked SQLAlchemy to confirm, and I was wrong.
I reverted this proposition.

@potiuk, @uranusjr I can not locate the second commit in heartbeat, unless it is implicite.
What do you think about moving this commit to the end of this block? I though that there were second commit (which I must have imagined myself) there.
Can there be a commit within heartbeat_callback function? If so, should we lean into excluding this callback from this transaction?

Another approach, which I was considering is to add commit as last operation in handle_db_transaction_with_session and to never have another write to db from the handle_db_transaction.

This way we have only a single update on the db, which will happen as last operation. If there were any problems, db should be clean at any time. Do you think that it could be feasible to have commit as last command and forbid using commit mid transaction?

I am actually now working on refactoring this part of the code in #30255, #30302 and #30308 (and one more to follow) in order to handle AIP-44 ( https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44 ) . There is already a part there to split the method into separate steps (#30308) and I think the discussion should happen around the time I do it there as it wil be changing slightly the behaviour of this particular transaction (this is not yet reviewed change so we might have more discussions there - for now I am waiting for the #30255 and #30302 to be reviewed and merged as they are prerequisite for that change.

…sk method.

potiuk · 2023-04-14T21:13:31Z

would be good to take a look at this one after the refactoring is complete. Closing for nw - and @apilaskowski - please reopen with conflict resolved if you would like to pursue it - the changes are going to be released in 2.6 and you should base the changes on those.

apilaskowski · 2023-04-14T23:27:30Z

Ok. I will.

sternr and others added 13 commits December 5, 2022 20:45

Update scheduler_job.py

a2f17a7

Add retry loop in case where sql query fails, this makes AF much more resilient to potential DB hiccups

Update scheduler_job.py

0c20186

Get retry count from config

Update scheduler_job.py

9f0fa5d

Changed retryCount config to reuse existing configuration parameter

Update scheduler_job.py

9a8e8f5

Migrate the range loop to use the run_with_db_retries method

Update scheduler_job.py

b3bfca2

Fixed bad commit

Update scheduler_job.py

a75a43d

Fixed bad commit

Update scheduler_job.py

0e5a901

Removed unnecessary break

Update scheduler_job.py

f9ff836

Fixed logging string concatenation

Merge branch 'main' of github.com:apilaskowski/airflow

8bd07cf

FIX: Change scheduler db errora message.

51f48e9

CHG: Name in log.error.

cd27adf

ADD: Resistance to db hiccup to hearthbeat.

4703be3

Merge branch 'apache:main' into main

a622571

apilaskowski requested review from XD-DENG, ashb and kaxil as code owners February 28, 2023 17:04

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Feb 28, 2023

Merge branch 'main' into main

836747a

apilaskowski and others added 3 commits March 1, 2023 11:53

Merge branch 'main' into main

d90b75b

CHG: time.time() -> time.monotonic() accoring to suggestion of @notat…

f53ed2a

…allshaw.

Merge branch 'main' of github.com:apilaskowski/airflow

ab43c75

eladkal added this to the Airflow 2.5.2 milestone Mar 1, 2023

eladkal changed the title ~~Making AF more resistant to DB hiccups~~ Add retry to the scheduler loop to protect against DB hiccups Mar 1, 2023

apilaskowski and others added 4 commits March 2, 2023 10:12

Fixing static checks.'

b0c8634

Merge branch 'main' into main

3542101

Fixing static checks. (whitespaces)

8d22bcd

Merge branch 'main' of github.com:apilaskowski/airflow

51fade2

ephraimbuddy modified the milestones: Airflow 2.5.2, Airflow 2.5.3 Mar 10, 2023

apilaskowski and others added 6 commits March 22, 2023 12:59

Merge branch 'main' into main

5b2bb40

Update base_job.py

74ad329

Splitting the retried block into two.

ADD: @retry_db_transaction to create_session().

1287302

CHG: Extract a function with session argument to be able to use @Retry..

21f4fb8

Merge branch 'apache:main' into main

d13860d

Merge branch 'main' into delegate_db_connection

475b834

apilaskowski and others added 2 commits March 22, 2023 17:09

CHG: Removed nonlocal.

f298fba

Merge branch 'main' into main

0588f1c

uranusjr reviewed Mar 23, 2023

View reviewed changes

airflow/jobs/base_job.py Outdated Show resolved Hide resolved

uranusjr reviewed Mar 23, 2023

View reviewed changes

airflow/jobs/base_job.py Outdated Show resolved Hide resolved

uranusjr reviewed Mar 23, 2023

View reviewed changes

apilaskowski added 2 commits March 23, 2023 13:26

CHG: Move handle_db_transaction_with_session method into handle_db_ta…

c2ab268

…sk method.

Merge branch 'main' of github.com:apilaskowski/airflow

e7a89fd

pierrejeambrun modified the milestones: Airflow 2.5.3, Airflow 2.5.4 Mar 27, 2023

apilaskowski force-pushed the main branch from 4cd8991 to e7a89fd Compare March 28, 2023 14:25

RM: Bad copy paste while moving function.

6a9ff86

ephraimbuddy modified the milestones: Airflow 2.5.4, Airflow 2.6.1 Apr 13, 2023

potiuk closed this Apr 14, 2023

ephraimbuddy removed this from the Airflow 2.6.1 milestone May 12, 2023

Add retry to the scheduler loop to protect against DB hiccups #29815

Add retry to the scheduler loop to protect against DB hiccups #29815

Uh oh!

Conversation

apilaskowski commented Feb 28, 2023

Uh oh!

apilaskowski commented Feb 28, 2023

Uh oh!

notatallshaw commented Feb 28, 2023

Uh oh!

apilaskowski commented Mar 1, 2023

Uh oh!

bjankie1 commented Mar 7, 2023

Uh oh!

uranusjr commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apilaskowski commented Mar 7, 2023

Uh oh!

potiuk commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apilaskowski commented Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

potiuk commented Apr 14, 2023

Uh oh!

apilaskowski commented Apr 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

uranusjr commented Mar 7, 2023 •

edited

Loading

potiuk commented Mar 7, 2023 •

edited

Loading

apilaskowski commented Mar 22, 2023 •

edited

Loading