-
-
Notifications
You must be signed in to change notification settings - Fork 782
Fix code coverage for ci-integration make tests target, speed up some Orquesta integration tests #4754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@blag Per Slack - if you feel adventurous you can also split it into multiple targets. I found that multi target approach quite convoluted and failed to get it to work in a reasonable amount of time frame so I reverted to a single target approach because I don't have too much time to spend on this (I really just randomly found out this issue and I wanted to fix it since it can speed things up). |
|
After digging into this, I also noticed that "new" Orquesta integration tests are very slow. With this change, existing integration tests itself take 5-8 minutes out of total of 16 minutes. So that's a huge chunk of integration tests run time (https://gist.github.com/Kami/f4539b6ef980502b8488a067b12720f3): 71 tests run in 569.491 seconds�[32m (71 tests passed)https://api.travis-ci.org/v3/job/557444949/log.txt @m4dcoder I assume there must be some "quick wins" aka performance optimizations possible for those tests (aka reduce some expensive setup or similar, I didn't have time to dig in)? Before those tests were added, they took on average ~12 minutes, now they take 18-22 minutes, or 15-16 with coverage fix from this PR. For reference, here is a link to "random" build before those tests - https://travis-ci.org/StackStorm/st2/builds/376401081?utm_source=github_status&utm_medium=notification (at that time, integration tests took 6-8 minutes and unit tests were the slowest build job) |
|
As far as fix in this PR goes - it looks like it saves us 3-6 minutes, not great, not terrible, I will take it. Hopefully with performance optimizations in Orquesta integration tests, we can get integration tests run time down to more reasonable 10-12 minutes again (cumulatively over a longer period this would means tons and tons of saved CPU cycle and developer time waiting on builds). |
|
On a related note - any objections to me removing all the various unused make targets? |
|
2c1d1a5 seems to help a bit (~ 14 minutes vs 16 minutes 40 seconds), but two tests are failing. https://travis-ci.org/StackStorm/st2/builds/565662789 vs https://travis-ci.org/StackStorm/st2/builds/565637432 I assume that's because they were designed around the idea that initial wait will always be at least 3 seconds and they will probably need to be re-designed a bit. In addition to that change and fixing the failing tests I propose moving the "biggest offenders" from those tests to a nightly build as discussed with @m4dcoder on Slack (we will need to do the same for Python 3 tox targets and shuffle things around). How should we handle / organize nightly tests? Perhaps have This should get us back in time frame of around 10 minutes for integration tests. |
|
I pushed a workaround which should fix the issue with two tests failing now that the global retry delay has been decreased - 8fc1423, bbebd02. Keep in mind that I didn't really make anything worse or change how things work. Those two tests already contain a race and relied on timing previously. The difference is that previously all the tests were delayed for up to 2-3 minutes because of the long Going forward, it would definitely be better to refactor those tests so they don't rely on specific timing of when /cc @m4dcoder |
| action_constants.LIVEACTION_STATUS_RUNNING | ||
| ] | ||
|
|
||
| DEFAULT_WAIT_FIXED = 3000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just change this to 1500 default so you don't have to override it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this should actually default to 500, not 3000.
I was testing many things and accidentally added a wrong change.
We should default to the lowest value possible which works for most of the tests and increase it for tests which need a higher value (like the ones which explicitly set it to 1500).
Will again verify 500 works correctly and push the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll pull down the changes later this PM and make a run myself. I still need to understand the changes you made to wait state method. I'll merge this PM if all is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kami Setting the DEFAULT_WAIT_FIXED = 500 should be sufficient without doing the inner functions and overriding wait_fixed in retrying. The wait_fixed just tells retrying to wait a fix interval between checking state again. So, 1500 just means wait a bit longer vs 500 which checks a few more times. But in the process of testing that, I uncovered a possible bug with orquesta (states flipping between failed and paused). That's probably what the wait_fixed=1500 is masking. I'm currently identifying the root cause in orquesta, which is tedious.
|
@m4dcoder Yep, that's why I set it to higher for some tests - I assumed there is some internal bug / race, but I didn't have the time and context to track it down so I just went with the "easy" approach (aka what the existing tests did before - just use a long wait time which "masks" / works around the real issue). It's great to hear that you are on it though and that you potentially identified the real root cause. Using low retry time for all the the tests will allow us to speed those tests up by a lot. |
Make sure we only generate coverage for runners and orquesta integration tests when ENABLE_COVERAGE environment variable is set to "yes". Coverage adds a lot of overhead so this should speed up PR builds.
invocation basis. Update tests which have a race / rely on timing to use longer wait time to avoid failure.
c1190a6 to
95e3f59
Compare
Making the default wait_fixed to 500 is sufficient. The override to 1500 is unnecessary due to a bug in orquesta.
95e3f59 to
442641c
Compare
|
Yay, integration tests are now down to 14-15 minutes from 18-22 minutes! Thanks for wrapping this up and finding and fixing the root cause for the "race issue" in Orquesta. |
This pull request fixes a bug with
ci-integrationmakefile test target.That target incorrectly enabled coverage for runners and orquesta integration tests when
ENABLE_COVERAGEenvironment variable was set tono. I randomly noticed then when looking at the integration tests output for a PR.Coverage adds quite a bit of overhead, so this should speed up PR tests where we don't enable coverage.
New make coverage introduced in #4147 are quite convoluted (at least to me) so I decided to combine various make targets inside a single one - if I didn't do that, I would probably need to add at least 6 or so new targets to get everything to work correctly.
On a related note - we seem to have a lot of unused make targets. Perhaps some clean up is in order.