Skip to content

Conversation

@apilaskowski
Copy link
Contributor

sternr and others added 13 commits December 5, 2022 20:45
Add retry loop in case where sql query fails, this makes AF much more resilient to potential DB hiccups
Get retry count from config
Changed retryCount config to reuse existing configuration parameter
Migrate the range loop to use the run_with_db_retries method
Fixed bad commit
Fixed bad commit
Removed unnecessary break
Fixed logging string concatenation
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Feb 28, 2023
@apilaskowski
Copy link
Contributor Author

@potiuk @kaxil @uranusjr
Please let me know if I should make any changes.

@notatallshaw
Copy link
Contributor

Small thing, and I'm not sure if Airflow devs already have a policy about this but time.time() is not guaranteed to be monotonic so you could report a negative value when you do time.time() - start_time.

You can use time.monotonic() or time.perf_counter() to get monotonic time.

@apilaskowski
Copy link
Contributor Author

Small thing, and I'm not sure if Airflow devs already have a policy about this but time.time() is not guaranteed to be monotonic so you could report a negative value when you do time.time() - start_time.

You can use time.monotonic() or time.perf_counter() to get monotonic time.

I made a change as @notatallshaw suggested.

@eladkal eladkal added this to the Airflow 2.5.2 milestone Mar 1, 2023
@eladkal eladkal changed the title Making AF more resistant to DB hiccups Add retry to the scheduler loop to protect against DB hiccups Mar 1, 2023
@bjankie1
Copy link
Contributor

bjankie1 commented Mar 7, 2023

There are multiple db commits in the block and I wonder if the internal state would be messed up if the db fails midway through and the entire block got rerun.

What multiple commits do you mean? I see the changes being related to heartbeat alone. Even in case of retry the changes executed in heartbeat block should be idempotent.

@uranusjr
Copy link
Member

uranusjr commented Mar 7, 2023

When each create_session block exits, it calls session.commit() and writes to the database. So say if a database failure happens during lines 226–232, the entire block would be restarted and you end up writing the database twice without calling heartbeat_callback correctly. I don’t know if it would be problematic, but it is awkward. Ideally I would imagine the two create_session blocks should be retried separately instead.

(As an aside: I’m not even sure why two create_session blocks are needed here in the first place. Could someone do some git blame detective work and find out if there is a reason behind it?)

@apilaskowski
Copy link
Contributor Author

Second creation of session was introduced here:
4905a55

@potiuk
Copy link
Member

potiuk commented Mar 7, 2023

When each create_session block exits, it calls session.commit() and writes to the database. So say if a database failure happens during lines 226–232, the entire block would be restarted and you end up writing the database twice without calling heartbeat_callback correctly. I don’t know if it would be problematic, but it is awkward. Ideally I would imagine the two create_session blocks should be retried separately instead.

@uranusjr Yep. This is very awkward. And this awkwardness is precisely what I am going to change during AIP-44 implementation. There is an (very unfinished) draft commit: f11f5af that depends on merging #29776 which will solve this issue (by splitting the "before", "during", and "after" task into three steps (and three DB sessions when internal DB api won't be used - when the internal API will be used there will be no "during" session at all).

@potiuk DB connection failures happen. Airflow is running in distributed environment. As much as we want to have a reliable infrastructure it's not possible all the time. Heartbeat is the most frequent operation in the airflow DB and it's success has a significant impact on the task success or failure. What problems do you anticipate with making it more resilient to disruptions?

@bjankie1 Yes. I understand that and sympathise with such statement. But the solution to that is not to retry a failed transaction without looking at the consequences (pointed out by @uranusjr nicely) . The proposal of yours is a "band-aid" which might create more problems. The right approach for making system resilent to DB problems is to understand every single DB transaction that happens in the system and deliberately design behaviour of what happens if the transaction fails and act appropriately to recover. Retrying failed transaction without understanding of the consequences is a recipe for disaster. And yes in this case the whole transaction is ... awkward - as also nicely pointed out by @uranusjr. and the right approach is to fix the behaviour to make more sense (and so you can reason about it) and only after that to implement recovery scenario (which might actually be not needed - because the way I think it will work when I complete the refactor is that those transactions will be split into tthree (or two in case of internal DB API) so that there will be no need to recover becuse the problem you observe will not exist.

@apilaskowski
Copy link
Contributor Author

apilaskowski commented Mar 22, 2023

@potiuk @uranusjr
What do you think about using @retry_db_transaction, which is already used extensively across Airflow?
I proposed a solution, which is better that previous one.
Now, I am not retrying whole blocks of code, which was a bad approach.

What do you think about using @retry_db_transaction for every create_session used (as proposed in base_job.run function)?

I hope that this is acceptable.
My internal tests suggests that it is enough to prevent temporary unavailable db (for a very short period of time).

# Make the session aware of this object
session.merge(self)
self.latest_heartbeat = timezone.utcnow()
session.commit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This existed in the old code but I don’t get why it’s needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the more I think, the more problematic this commit seems (I think there’s at least one other below). Say a database executes this commit, and then hiccups during heartbeat_callback. This entire block would be retried again with an extra update. That doesn’t feel right to me. We should perhaps either

a. remove this commit so the entire block shares one single transaction and gets rolled back and restarted when something bad happens, or
b. split this into two blocks (by the commit call), and retry them separately

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I share very similar concerns. I think we should be very deliberate on retrying full transactions, there are likely subtle issues that we don' realise by allowing retries mid-transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to remove session.commit().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the consequences if you do ? Do you know that ? What's the ground you propose it on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I did know, but I went though the code and checked SQLAlchemy to confirm, and I was wrong.
I reverted this proposition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@potiuk, @uranusjr I can not locate the second commit in heartbeat, unless it is implicite.
What do you think about moving this commit to the end of this block? I though that there were second commit (which I must have imagined myself) there.
Can there be a commit within heartbeat_callback function? If so, should we lean into excluding this callback from this transaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach, which I was considering is to add commit as last operation in handle_db_transaction_with_session and to never have another write to db from the handle_db_transaction.

This way we have only a single update on the db, which will happen as last operation. If there were any problems, db should be clean at any time. Do you think that it could be feasible to have commit as last command and forbid using commit mid transaction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually now working on refactoring this part of the code in #30255, #30302 and #30308 (and one more to follow) in order to handle AIP-44 ( https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44 ) . There is already a part there to split the method into separate steps (#30308) and I think the discussion should happen around the time I do it there as it wil be changing slightly the behaviour of this particular transaction (this is not yet reviewed change so we might have more discussions there - for now I am waiting for the #30255 and #30302 to be reviewed and merged as they are prerequisite for that change.

@potiuk
Copy link
Member

potiuk commented Apr 14, 2023

would be good to take a look at this one after the refactoring is complete. Closing for nw - and @apilaskowski - please reopen with conflict resolved if you would like to pursue it - the changes are going to be released in 2.6 and you should base the changes on those.

@potiuk potiuk closed this Apr 14, 2023
@apilaskowski
Copy link
Contributor Author

Ok. I will.

@ephraimbuddy ephraimbuddy removed this from the Airflow 2.6.1 milestone May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants