Integrate the run coverage command to the CI #37152

ephraimbuddy · 2024-02-02T21:11:52Z

This is an attempt to prevent a file with full test coverage from slipping below 100% coverage.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

potiuk · 2024-02-21T23:53:27Z

This is not likely to succeed.. The nature of tests we have - where we gather coverage and combine them from separate runs is such that every single run will get smaller coverage than the "combined" one. What we are really after most likely is to have the coverage bot to comment back (and pottentially mark PR as not-green when the coverage drops. We used to have such coverage bot but we disabled it, but maybe it's time to re-enable it since we started back to look at coverage.

ephraimbuddy · 2024-02-22T07:12:23Z

This is not likely to succeed.. The nature of tests we have - where we gather coverage and combine them from separate runs is such that every single run will get smaller coverage than the "combined" one. What we are really after most likely is to have the coverage bot to comment back (and pottentially mark PR as not-green when the coverage drops. We used to have such coverage bot but we disabled it, but maybe it's time to re-enable it since we started back to look at coverage.

I think the current issue is around Files for DB-tests and non-dbtests. I will try more and see if it will solve.

ephraimbuddy · 2024-02-28T08:04:25Z

It looks like it worked @potiuk. Failure will ask the developer to update tests or remove the file from the list of not fully covered. Though I suspect this is an additional headache, I think it will be more useful in that it points to specific files to improve or that have dropped in coverage.

potiuk · 2024-02-28T09:02:50Z

This is nice.

I like the idea that we are focusing on coverage of particular test files - because since we are doing selective tests, this is the only way to get it somewhat close to full coverage. It's not ideal - because sometimes for example - we will get coverage in API by running a different test, but this is actually a nice way to find tests that are put in a wrong folder and we can figure out that one later.

There are however a few small problems I see (but I thin we can figure out how to solve them):

Some files have now split between DB tests and non-DB tests - which means that the "full" coverage we will get only after we get both test. This is quite often situatoin to be honest.

This is is solvable - by producing coverage files, uploading them as artefacts and then have a separate job which will download all the coverage files, combine them and only after that will report success/failure. Most of the code of your will woirk - but it could be simply separate job run after everything we've done. We do a very similar thing now already for "summarize warning" - it is done for the very same reason. All the tests are producing warning files and then we have a single job that pulls all of them, deduplicates and summarizes them.

This also has a very nice effect - individual tests will not be failing. This is scary to see your tests failing - you need to take a look at the output to find out that this is coverage, rather than actual test failure. Having a aseparate job is way better - because that job can be name "Coverage Check" or similar and when it fails, you will immediately know that it's the coverage that failed.

Plus we can make that single job non-fatal - i..e. failure in coverage will not mean that the whole job is failed (we do that for Quarantine tests) - at least initially when we will be figuring out all quiirks and problem, we can accept the job faiiling for a while until we find and solve all the imperfectness we will see from people running actual PRs. Or initially - for testing we could run it without failing at all - just having a summary and have a look and see how it works for various PRs. Then gradually we could make it fail (but non -fatal) and eventually make it a "must" to merge PR to have it green, once we are sure that there are very few false negatives.

The information that the coverage dropped without giving an easy way how to check it is not very helpful for PR author

I think it woud be good (and this is not a very complex thing) to have a way to show the diff of coverage for the files that have problem: before / after. I think that can be done nicely with generating right URL in the remote codecov - when we upload the coverage from the test, there is a way to see it and even see the diff I think. Seeing just "coverage dropped" and not giving the user a clue how to check it is bad - seeing URL will help. Also there is this GitHub Action from Codecov which we could use.

I think such approach might be way better than what the Github Action integration provides by default with their actions they post some comments in PR with summary etc. but I find it far too spammy - having a separate JOB with "code coverage" information, indication of problem (by status) and list of files with problems with an URL to follow is a way better thing.

Fulll vs. expected coverage

I think eventually rather than full coverage, we could turn it into coverage drop and this way we could increase the number of files we could monitor easily. But that's another story.

ephraimbuddy · 2024-02-28T12:33:41Z

I like the suggestions. From the suggestions, we are going to do two things, the 3rd one will be pending.

separate job for the coverage analysis
Given what we did in warnings text job, I believe it's possible but will prefer to do it in a separate PR. For now, I will make this non-fatal and think about how to collect the analysis and implement it like the warnings job in a separate PR.
Make the coverage failure understandable by providing URLs for the user to look at and understand where the problem is coming from.
This can also happen in a separate PR, WDYT?

ephraimbuddy · 2024-02-28T12:48:11Z

I can't reproduce the static check failure locally. Seems I will manually copy from CI and edit locally

potiuk · 2024-02-28T14:47:34Z

This can also happen in a separate PR, WDYT?

Yes. If we just do for information only - I think now failing the CI tests for users is too troublesome.I think maybe printing the information about coverage loss in output is ok- but not failing the PR.

I can't reproduce the static check failure locally. Seems I will manually copy from CI and edit locally

Looks like bad merge + the need to regenerate images. It should be reproducible if you run breeze static-checks --all-files

potiuk · 2024-02-28T15:01:48Z

I looked a bit closer ... And unfortunately I think there are few problems:

Currenlty seems that test failures are ignored (so when pytest exits with error, the script does not exit with error):

https://github.com/apache/airflow/actions/runs/8080612432/job/22077745306?pr=37152#step:5:7061
https://github.com/apache/airflow/actions/runs/8080612432/job/22077744261?pr=37152#step:6:4899

  
      def do_commit(self, dbapi_connection):
  >       dbapi_connection.commit()
  E       sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
  E       	This probably means the server terminated abnormally
  E       	before or while processing the request.
  E       
  E       (Background on this error at: https://sqlalche.me/e/14/e3q8)

FAILED tests/core/test_impersonation_tests.py::TestImpersonation::test_no_impersonation - sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)

Seems that the way how now coverage works - that flaky impersonation test is triggered very frequently - I checked 4 tests and 3 of them had it. So I think something is not really ok here. It might because we run the way how pytest is run. I think we should fix the exit code and see really how often it fails, and maybe we will have a chance to fix the test ?

potiuk · 2024-02-28T15:23:56Z

One more thing. I think I right now - the "upload-coverage" is not used by the test - at all from what I see. And the thing is - it was is implemented so that individual PRs are not sending coverage data to the server - because that's where the real "final" coverage is determined by combining all the coverage from all the tests for a given run are done. I am not sure if fhat's the case - looks like it's not handled the right way any more. @Taragolis - can you help with it?

BTW. I think the flaky tests might be because the coverage data is collected in memory with cov.collect(). I think that might have a huge side-effect on the whole test suite being run. I would say, possibly a better way to get the coverage in this case would be to produce the coverage xml file (as it was before) and to retrieve coverage information by parsing the xml. that is probably less invasive way of doing it, I am a litle afraid that enabling coverage for all PRs of all our contributors (including those that run on public runners) might introduce a great flakiness problem.

When the error code is fixed, you should re-run this test with use public runners label to see if that is not somethng that will be happening

ephraimbuddy · 2024-02-29T09:05:03Z

One more thing. I think I right now - the "upload-coverage" is not used by the test - at all from what I see. And the thing is - it was is implemented so that individual PRs are not sending coverage data to the server - because that's where the real "final" coverage is determined by combining all the coverage from all the tests for a given run are done. I am not sure if fhat's the case - looks like it's not handled the right way any more. @Taragolis - can you help with it?

Given the function I have as upload-coverage, I think it still works the right way?

    def upload_coverage(self) -> str:
        if self.event_name == "push" and self.head_repo == "apache/airflow" and self.ref == "refs/heads/main":
            return "true"
        return "false"

We only upload coverage when the tests were run on the main branch.

BTW. I think the flaky tests might be because the coverage data is collected in memory with cov.collect(). I think that might have a huge side-effect on the whole test suite being run. I would say, possibly a better way to get the coverage in this case would be to produce the coverage xml file (as it was before) and to retrieve coverage information by parsing the xml. that is probably less invasive way of doing it, I am a litle afraid that enabling coverage for all PRs of all our contributors (including those that run on public runners) might introduce a great flakiness problem.

When the error code is fixed, you should re-run this test with use public runners label to see if that is not somethng that will be happening

Yeah, the failure needs to be investigated. I will check again and add the label

potiuk · 2024-02-29T13:16:00Z

We only upload coverage when the tests were run on the main branch.

Right.. Stupid me. I missed that we have if in the codecov action :)

ephraimbuddy · 2024-02-29T13:28:38Z

I reran the tests twice; the previous failing impersonation tests have been corrected. I couldn't still reproduce the static failure locally and what is being requested to be updated on the CI doesn't make sense to me

scripts/ci/docker-compose/local.yml

potiuk · 2024-02-29T18:22:37Z

BTW. The error code from pytest is still not propagated to exit code of the test script:

This test is failing at the impersonation - but the result of each test type and job is still green:

potiuk · 2024-02-29T18:24:19Z

This is what you see when you unfold the Core folded test results now - and the impersonation test is still failng in a number of the tests (but the jobs look green because of the exit code not propagated)

This is an attempt to prevent a file with full test coverage from slipping below 100% coverage. Use multiprocessing for coverage Remove non-db test files from covered when running db tests Remove files that have full coverage from not fully covered files Add color to error message Update hashes Add back run-coverage for hatch regenerate images Move the reporting out of analyzing and make this non-fatal Report coverage before analysing Fix static checks

potiuk · 2024-03-01T08:19:45Z

Yeah. That's what I was afraid - the impersonation test fail in ALL tests now :)

potiuk · 2024-03-01T08:20:39Z

Yeah. That's what I was afraid - the impersonation test fail in ALL tests now :)

But that is good - because it failed intermittently and if find the problem and fix it here, we will be quite certain it will be fixed :)

ephraimbuddy · 2024-03-01T08:22:40Z

But that is good - because it failed intermittently and if find the problem and fix it here, we will be quite certain it will be fixed :)

Yes. That's the next thing I will be looking at. Thanks for checking into the tests, it didn't occur to me to even check!

Taragolis · 2024-03-01T08:30:17Z

Yeah. That's what I was afraid - the impersonation test fail in ALL tests now :)

Ahhhh... as far as i remember this test is combination of the pain and workaround in addition we tests here against something which doesn't work well or to hard debug in Airflow:

Backfill
SubDags
Impersonation

potiuk · 2024-03-01T08:38:57Z

Yeah. We might want to remove or simplify the test.

BTW. There is one more thing that I noticed - all of the tests in this build take upwards of 25m (some even more than 30). Where Canary builds - those are using the "old way" of gettting the coverage are between 12m and 25m (SQLite tests is where you can see the biggest difference).

Compare the current build with for example this one:

https://github.com/apache/airflow/actions/runs/8103881626/job/22149538266

Possibly the reason is the same as impersonation - I suspect that gathering coverage information in the way we run it now take much more memory - also it could be that test output etc. are buffered (or smth like that). One way to check it is to add a "debug CI resources" label on the build and make a regular PR also with "debug CI resources" and "full tests needed" label (where you remove the check where coverage is only enabled for canary buils)

Then you can have a "Regular PR" - running coverage and "New coverage PR" - running coverage and compare the two (including resource usage). That might give a clue on the differences.

potiuk · 2024-03-01T08:45:30Z

Also the "use public runner" and comparing regular PR with no coverage enabled with "with coverage" might show some surprises. I think adding coverage slows down the tests (in general - even in the old way) by 20% - 30%, but we need to run a few test runs to see that. That's one of the reasons we have the coverage tests disabled for regular PRs.

We need a bit more datapoints for that - and see if it makes sense to enable coverage for absolutely all PRs, maybe we can find some way where we only selectively enable it (like when there are core changes maybe - depends what is really our ultimate goal here - because if we enable it and won't use it (but pay the cost) - that's likely not a good idea.

Taragolis · 2024-03-01T08:48:40Z

Yeah. We might want to remove or simplify the test.

Yeah the test is hacky because we have to change some settings into the runtime for allow us to tests some very old bug

Taragolis · 2024-03-01T08:57:35Z

We need a two counters 🤣

Time spend to adjusts something to coverage or codecov
Time spend to fix impersonation tests

potiuk · 2024-03-01T09:15:13Z

We need a two counters 🤣

Time spend to adjusts something to coverage or codecov
Time spend to fix impersonation tests

Yes. Second comment in second answer here might be nice template https://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered

Taragolis · 2024-03-01T09:50:20Z

BTW, in attempt to fix some flakey tests we discuss to run specific tests isolated from other tests, for avoid some unpredictable behavior and side effect which it may add to other tests, e.g. we have a tests which raises SIGSEGV 🤣
But as far as I remember the final decision was not to do that.

In theory there is pytest-forked package which make allow to run individual tests into the forked process, however there are some moments:

Project is not maintained anymore, it work I've checked locally
Because it is forked process it more likely will have some side effects

potiuk · 2024-03-01T10:43:07Z

We already have quarantined tests for that kind of approach - they are skipped by default when the test is marked with quarantined and we can have a separate isolated marker in similar way - but without ignoring the exit code.

Taragolis · 2024-03-01T12:12:40Z

Yeah, something like that. It would be nice to have run isolated over the all backends and pythons versions.
Still no idea how to isolate each hacky tests of each other, rather than run pytest for individual test 😵‍💫
We could even disable half of pytest plugins (include coverage) for this kind of isolated tests, for avoid potential side effects.

Taragolis · 2024-03-01T12:18:25Z

BTW we could mark temporary tests/core/test_impersonation_tests.py as quarantined (not a first time) and create a task for move it into isolated and implements it.

potiuk · 2024-03-01T12:36:28Z

BTW we could mark temporary tests/core/test_impersonation_tests.py as quarantined (not a first time) and create a task for move it into isolated and implements it.

Tempting :)

Taragolis · 2024-03-13T10:48:17Z

@ephraimbuddy I think it worthwhile to check it again after we move impersonations as quarantined

potiuk · 2024-03-13T12:38:21Z

@ephraimbuddy I think it worthwhile to check it again after we move impersonations as quarantined

Agree :)

github-actions · 2024-05-01T00:12:59Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

boring-cyborg bot added the area:dev-tools label Feb 2, 2024

ephraimbuddy force-pushed the protect-code-coverage-ci branch from 2d22593 to 8ffd3c5 Compare February 3, 2024 18:34

ephraimbuddy force-pushed the protect-code-coverage-ci branch 2 times, most recently from a647e52 to 2114b8b Compare February 20, 2024 14:32

ephraimbuddy marked this pull request as ready for review February 20, 2024 14:32

ephraimbuddy requested review from ashb, jedcunningham, kaxil and potiuk as code owners February 20, 2024 14:32

ephraimbuddy force-pushed the protect-code-coverage-ci branch from 2114b8b to 7cf2756 Compare February 20, 2024 18:27

ephraimbuddy force-pushed the protect-code-coverage-ci branch 3 times, most recently from 2492697 to 65952ea Compare February 28, 2024 07:53

ephraimbuddy force-pushed the protect-code-coverage-ci branch from 65952ea to 086c109 Compare February 28, 2024 12:47

ephraimbuddy added the use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing) label Feb 29, 2024

ephraimbuddy force-pushed the protect-code-coverage-ci branch 2 times, most recently from 14cf57d to 0d08897 Compare February 29, 2024 12:43

potiuk reviewed Feb 29, 2024

View reviewed changes

scripts/ci/docker-compose/local.yml Outdated Show resolved Hide resolved

ephraimbuddy force-pushed the protect-code-coverage-ci branch from 0d08897 to 245ba81 Compare February 29, 2024 17:39

ephraimbuddy added 4 commits March 1, 2024 08:35

fixup! Integrate the run coverage command to the CI

a7e42c5

Fix static checks

56a9fa6

Add return code of pytest run

48698c6

ephraimbuddy force-pushed the protect-code-coverage-ci branch from 245ba81 to 48698c6 Compare March 1, 2024 07:35

Only write coverage report if on main branch

d7f7364

Taragolis mentioned this pull request Mar 11, 2024

Move tests/core/test_impersonation_tests into the Quarantined #38024

Merged

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label May 1, 2024

github-actions bot closed this May 7, 2024

Integrate the run coverage command to the CI #37152

Integrate the run coverage command to the CI #37152

Uh oh!

Conversation

ephraimbuddy commented Feb 2, 2024

Uh oh!

potiuk commented Feb 21, 2024

Uh oh!

ephraimbuddy commented Feb 22, 2024

Uh oh!

ephraimbuddy commented Feb 28, 2024

Uh oh!

potiuk commented Feb 28, 2024

Uh oh!

ephraimbuddy commented Feb 28, 2024

Uh oh!

ephraimbuddy commented Feb 28, 2024

Uh oh!

potiuk commented Feb 28, 2024

Uh oh!

potiuk commented Feb 28, 2024

Uh oh!

potiuk commented Feb 28, 2024

Uh oh!

ephraimbuddy commented Feb 29, 2024

Uh oh!

potiuk commented Feb 29, 2024

Uh oh!

ephraimbuddy commented Feb 29, 2024

Uh oh!

Uh oh!

potiuk commented Feb 29, 2024

Uh oh!

potiuk commented Feb 29, 2024

Uh oh!

potiuk commented Mar 1, 2024

Uh oh!

potiuk commented Mar 1, 2024

Uh oh!

ephraimbuddy commented Mar 1, 2024

Uh oh!

Taragolis commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Taragolis commented Mar 1, 2024

Uh oh!

Taragolis commented Mar 1, 2024

Uh oh!

potiuk commented Mar 1, 2024

Uh oh!

Taragolis commented Mar 1, 2024

Uh oh!

potiuk commented Mar 1, 2024

Uh oh!

Taragolis commented Mar 1, 2024

Uh oh!

Taragolis commented Mar 1, 2024

Uh oh!

potiuk commented Mar 1, 2024

Uh oh!

Taragolis commented Mar 13, 2024

Uh oh!

potiuk commented Mar 13, 2024

Uh oh!

github-actions bot commented May 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Taragolis commented Mar 1, 2024 •

edited

Loading

potiuk commented Mar 1, 2024 •

edited

Loading

potiuk commented Mar 1, 2024 •

edited

Loading