Increase timeouts for flaky daskexecutor test #35473

potiuk · 2023-11-06T09:07:53Z

The recent change with splitting the DB from Non-DB tests likely increased a bit pressure on paralell-executing DB tests because they are likely often competing for resources such as DB I/O /networking with other tests running in parallel on the same machine, so the dask executor tests started to (again) timeout.

Increasing the timeout might help - if not we will quarantine the tests and try to solve it differently.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

The recent change with splitting the DB from Non-DB tests likely increased a bit pressure on paralell-executing DB tests because they are likely often competing for resources such as DB I/O /networking with other tests running in parallel on the same machine, so the dask executor tests started to (again) timeout. Increasing the timeout might help - if not we will quarantine the tests and try to solve it differently.

bolkedebruin · 2023-11-06T10:40:36Z

2min is a long time for testing. And what is the difference with the function above?

potiuk · 2023-11-06T16:36:33Z

2min is a long time for testing. And what is the difference with the function above?

It's long but when we run the tests in parallel where we have 8 docker compose instances running heavy DB tests at the same time - each with its own database on a single machine where I/O and especially kernel is shared between all the tests and DBs and at the same time some of the paralell tests might be resetting and wiping their databases, - all at the AWS instances that already have a rather slow-ish filesyste, the I/O contention and competing for it might cause huge delays - tests that runs normally 2 seconds might and sometimes (but only sometimes) will run for 10s of mintues if the contentions happen all the time with all the DB operations - say you have 1000 block writes to disk during the test, just having a 10 ms waiting time for every write will make it 10 seconds.

This is my guess why we have this sometimes much longer running tests.

Generally we are parallising a lot of those tests utilising multiple CPUS and a lot of memory, each of the machines runs up to 16 containrers during the test and they re utilising generally 80% - 90% of the 8 CPUs they have. Unfortunatley (even after my recent quest to separate the DB from non-DB tests) - we still have a lot (7000) of tests that require the DB and those tests must be run in a serialized way if they use the same DB (because they use the DB to store and retrieve state and they clean the DB up before and after usually). And we need to optimize those for "Test time" not only CPU - that's why we have highly optimized setup where we split those tests in several parallel types - each of them running their DB tests sequentially - each with own DB. This does not scale linearly (due to I/O and kernel contention mentioned above) but they scale pretty well (2 CPU is around 3 times slower than 8 CPU instance to run full suite of DB tests).

Unfortunately - side effect of this setup is that sometimes, when there is a lot of contention from "nosiy neighbours" on the same machine in different container instances, test that normally take 10 seconds will suddenly take 20 or 30 or even 60 seconds.

Those are usually the more complex tests - they are not really unit tests, but they are really end-to-end tests of certain functionality - in this case this is a test that performs backfill using Dask Executor - and it utilises the whole mechanism of our backfill. Which means that it performs all the back/forth database reading and writing, runs scheduler and executor to handle the backfill operation etc. And this means that it might get a lot of contention from parallell tests doing similar heavy work (if so it happens that the tests that are run in parallell in different containers are also doing a lot of DB operations at the same time). And the last part is not really a deterministic one. It varies from machine to machine and from run to run. It's also pretty known fact that not all Amazon instances are equal - some are "better" than others - in terms of being closer of further in terms of roundrip time to the DB or having slower memory or having more noisy neighbour on the bare metal (those are VMs not bare-metal machines so they might run in parallel with other VMs that are run on the same hardware. Layers upon layers of abstraction that make "test time" in this particular case pretty unpredictable and wildly varying - from run to run,

The timeout increase in this case is our "what is the worse that can happen for the tests while still consider it a success" - and there are sometimes cases where we know that a given test will simply take sometimes a long time. In this case I am guessing it's not hanging but simply taking slightly (or even more slighlty) longer than 30 seconds. And there is no way to find out other than ... increasing the timeout. If it will stop happening after timeout increase, it means that our guess was right. If it will continue happening after the timeout inreased significantly we will have to look deeper (I will likely quarantine the test and create an issue and we will look into addressing it before 2.8 - we try to keep quarantined test level low - we have currently 3 of those).

The value of increase I usually do is 3x (this is why 100 which is roughly 3x original 30) - to give enough of a space. Also it requires to inscrease per-test "hard" timeout to a bit more than that = 120 becuase we have currently 60s as the default "hard" test time for each test. This is in order to get such test fail rather than hang the whole build because if we let the test hang till the whole test job timeout (80 minutes now I believe) then the job will be cancelled and we will not know what test caused it. So both 100 and 120 are the values that come from experience of unflak-ing other tests and experiment to see if my guess is right. Note - it's not final value. It's experiment to see if our hypothesis is right. Once we run our CI builds quite a few times, and we will see this problem gone (which as usual I am planning it), we will be able to see how much those tests take when succed on average - we will see if they take 30 or 60 or 90 seconds sometimes to succeed - I usually take a random sumple from a 10-20 longest running tests I can find over last week or so - and our tests have summaries at the end showing the longest running tests and we can see there what is the variation and maximum time we can expect for the test.

This is what I do when I fight with such flaky tests usually.

I hope the explanation is detailed enough to describe the reasoning of the change and root causes of what's happening here.

Is that good enough of an explanation? Or maybe (knowing the context) do you have some ideas how we can improve the process and treat those tests differently ? I am all ears @bolkedebruin :).

Really I would love others to atempt to diagnose and fix those flaky tests so if there are other ideas to help with those - I am all for reviewing and commenting PRs and attempts to get rid of thsoe flakes :). I would really, really, really love that.

potiuk · 2023-11-06T16:50:23Z

BTW. It's also possible that this test and dask executor integration has a bug or race condition. And yes I have a plan for that too.

For now the hypothesis is it's just the contention - it used to happen in the past, then disappeared after some test optimisations, then appeared again - but by increasing the timeout and letting 500 or so jobs to run it and see if still occurs from time to time is just the easiest thing to do to verify this hypothesis.

If it is wrong, without spending too much time on trying to understand dask executor details we will have no choice but to ask Dask community for help.

We've been in the past reaching out to Dask community for dask executor problems (and sometimes they helped) and we might do it this time as well asking them for help. This time however - after moving dask executor to provider - if we cannot quickly solve it ourselves we have simple mechanism.

Since dask executor is now in provider, we can simply suspend the provider from being released. We have process for that, we've been doing that in the past for yandex and qubole. And if we still have flakiness problem with it and no solution at hand, I will propose suspending dask executor and reaching out to dask community that without their help it will remain suspended - which will mean:

no tests run in main
only what's released as provider will be used in the future
no more PRS/fixes/dependency updates to the provider (until it's resumed)
no more dask extra in the next MINOR Ariflow release

And until the flakiness problem is diagnosed and fixed by somoene (and run 100 or so PRs without showing the flakiness) we will not resume the provider.

This will be my plan B if increasing timeout will not solve the problem. But I hope it will

potiuk · 2023-11-06T21:32:25Z

Would love to merge it for the reasons explained above :) ^^

TL;DR; Have a chance to deal with flaky backfill test and stabilize our builds :)

hussein-awala

I agree that 2 min is too long for a test, but if it's to avoid failing the CI job because of this flaky test, there isn't a problem.

bolkedebruin · 2023-11-06T21:56:12Z

Gotcha @potiuk (and thats summary of all of the above :-P ) I'm just confused that we are end-2-end testing this, including the db. It seems not to make sense to me, but then I would need to dive in.

potiuk · 2023-11-06T22:09:23Z

Gotcha @potiuk (and thats summary of all of the above :-P ) I'm just confused that we are end-2-end testing this, including the db. It seems not to make sense to me, but then I would need to dive in.

Yeah. they are likely a few of those that we do needlessly. I am not sure if dask executor is that used at all... And well.. maybe one day someone will trace and remove those tests. We have accumulated a lot of those I am now merely on a quest to keep them, running, optimizing their executions, tracking flakes etc. But if somoene would like to improve those - or even remove .... Welll. I am fully supportive.

Side comment:

That's why maybe suspending dask provider might be a good idea anyway - together with it's tests, regardless from results of the experiment. People will still be able to use the released latest provider if they want, we do not loose anything by disabling the code (and it's related tests) and maybe later even removing it.

This is what I think we should start doing more and starting with Qubole being removed in the next week or so. The way how we have providers organized now - we could technically suspend and then later remove ALL providers that we are unsure of and wait for people to come to us and unsuspend those they use and want to release them.

I am actually seriously considering this approach and be more aggressive in those suspensions (and later removals). The providers will remain relased in PyPI and continue working, if there is no maintenance work, we should likely ... stop maintaining those and stop releasing them (and remove the code from our codebase eventually). This is what we've been discussing before I believe, and it's now entirely possible, with qubole it will be tested and we could generally proceed with it much more aggressively I think.

potiuk · 2023-11-06T22:16:55Z

That is my (not-so) secret plan on how we can remove stuff without going Airflow 3. I wrote about it at the devlist few times. If we want to get rid of something without backwards incompatibilities we can definitely follow that route:

move things we want to remove to provider (with back-compat PEP-563 dynamic attributes (as we did with cotnrib/dask/hive macros)
release provider few times
suspend the provider
remove the provider

It's all fully backwards compatible, but we stop maintaining the provider simply.

We can do it for a number of things we want to get rid of - for example AIP-56 (and few other things) will allow to eventually remove FAB this way. Maybe not immediately, but it's now quite achievable - with some caveats re: security (cc: @eladkal)

potiuk · 2023-11-07T11:56:21Z

FYI: @bolkedebruin - so far so good. I have not seen any "self-hosted" test failing with this flaky test since yesterday (and I am looking quite regularly).

Looks like my hypothesis about parallel job contention on some resources (I/O most likely) was right (and this test is particularly vulnerable):

Usually this test takes 8-19 seconds:

https://github.com/apache/airflow/actions/runs/6783603921/job/18438411043?pr=35492#step:6:2660

  9.97s call     tests/providers/daskexecutor/test_dask_executor.py::TestDaskExecutorQueue::test_dask_queues_no_queue_specified
  8.46s call     tests/providers/daskexecutor/test_dask_executor.py::TestDaskExecutorQueue::test_dask_queues

Buit sometimes longer:

https://github.com/apache/airflow/actions/runs/6783603921/job/18438410767?pr=35492#step:6:2939

  14.14s call     tests/providers/daskexecutor/test_dask_executor.py::TestDaskExecutor::test_dask_executor_functions
  12.40s call     tests/providers/daskexecutor/test_dask_executor.py::TestDaskExecutorQueue::test_dask_queues
  10.83s call     tests/providers/daskexecutor/test_dask_executor.py::TestDaskExecutorQueue::test_dask_queues_no_queue_specified

And there are cases where they are way longer:

https://github.com/apache/airflow/actions/runs/6783603921/job/18438412609?pr=35492#step:6:3399

  23.37s call     tests/providers/daskexecutor/test_dask_executor.py::TestDaskExecutorQueue::test_dask_queues
  22.87s call     tests/providers/daskexecutor/test_dask_executor.py::TestDaskExecutorQueue::test_dask_queues_no_queue_specified
  22.14s call     tests/providers/daskexecutor/test_dask_executor.py::TestDaskExecutor::test_dask_executor_functions

So I guess increasing timeout in this case was the right call to decrease the probability of flakiness.

Of course better soulution would be to make the test less `fragile' - but this is an exercise for someone who understands Dask integration better and spend time/assess if the test can be improved.

Or maybe follow the Dask provider removal, which would be an ultimate improvement in stability possibly.

Similar to apache#35473 - this test sometimes (very rarely) exceeds the default 60 seconds timeout when run in Paralllel test environment where we have multiple containers and multiple databases and multiple parallel tests runnign at the same time on single virtual machine (in order to maximise the gains from parallelism). This test sometimes fails with **just** above 60 seconds, so following the 3x rule explained in the long description of reasoning why we are doing it apache#35473 (comment)

…35511) Similar to #35473 - this test sometimes (very rarely) exceeds the default 60 seconds timeout when run in Paralllel test environment where we have multiple containers and multiple databases and multiple parallel tests runnign at the same time on single virtual machine (in order to maximise the gains from parallelism). This test sometimes fails with **just** above 60 seconds, so following the 3x rule explained in the long description of reasoning why we are doing it #35473 (comment)

The recent change with splitting the DB from Non-DB tests likely increased a bit pressure on paralell-executing DB tests because they are likely often competing for resources such as DB I/O /networking with other tests running in parallel on the same machine, so the dask executor tests started to (again) timeout. Increasing the timeout might help - if not we will quarantine the tests and try to solve it differently.

…pache#35511) Similar to apache#35473 - this test sometimes (very rarely) exceeds the default 60 seconds timeout when run in Paralllel test environment where we have multiple containers and multiple databases and multiple parallel tests runnign at the same time on single virtual machine (in order to maximise the gains from parallelism). This test sometimes fails with **just** above 60 seconds, so following the 3x rule explained in the long description of reasoning why we are doing it apache#35473 (comment)

…35511) Similar to #35473 - this test sometimes (very rarely) exceeds the default 60 seconds timeout when run in Paralllel test environment where we have multiple containers and multiple databases and multiple parallel tests runnign at the same time on single virtual machine (in order to maximise the gains from parallelism). This test sometimes fails with **just** above 60 seconds, so following the 3x rule explained in the long description of reasoning why we are doing it apache/airflow#35473 (comment) GitOrigin-RevId: 653ff1c5ff6519b98faa30022bd961e6825c8d75

boring-cyborg bot added area:providers provider:daskexecutor provider:datadog labels Nov 6, 2023

potiuk requested review from Taragolis, eladkal, ephraimbuddy and hussein-awala November 6, 2023 09:37

potiuk requested a review from vincbeck November 6, 2023 21:31

hussein-awala approved these changes Nov 6, 2023

View reviewed changes

vincbeck approved these changes Nov 6, 2023

View reviewed changes

potiuk merged commit 7b8272b into apache:main Nov 6, 2023

potiuk deleted the increase-timeouts-for-daskexecutor branch November 6, 2023 22:10

potiuk mentioned this pull request Nov 7, 2023

Increase timeout for flaky Test_BranchPythonDecoratedOperator test #35511

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase timeouts for flaky daskexecutor test #35473

Increase timeouts for flaky daskexecutor test #35473

Uh oh!

potiuk commented Nov 6, 2023

Uh oh!

bolkedebruin commented Nov 6, 2023

Uh oh!

potiuk commented Nov 6, 2023

Uh oh!

potiuk commented Nov 6, 2023 •

edited

Loading

Uh oh!

potiuk commented Nov 6, 2023

Uh oh!

hussein-awala left a comment

Uh oh!

bolkedebruin commented Nov 6, 2023

Uh oh!

potiuk commented Nov 6, 2023

Uh oh!

potiuk commented Nov 6, 2023 •

edited

Loading

Uh oh!

potiuk commented Nov 7, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Increase timeouts for flaky daskexecutor test #35473

Increase timeouts for flaky daskexecutor test #35473

Uh oh!

Conversation

potiuk commented Nov 6, 2023

Uh oh!

bolkedebruin commented Nov 6, 2023

Uh oh!

potiuk commented Nov 6, 2023

Uh oh!

potiuk commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Nov 6, 2023

Uh oh!

hussein-awala left a comment

Choose a reason for hiding this comment

Uh oh!

bolkedebruin commented Nov 6, 2023

Uh oh!

potiuk commented Nov 6, 2023

Uh oh!

potiuk commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

potiuk commented Nov 6, 2023 •

edited

Loading

potiuk commented Nov 6, 2023 •

edited

Loading

potiuk commented Nov 7, 2023 •

edited

Loading