Fix code coverage for ci-integration make tests target, speed up some Orquesta integration tests #4754

Kami · 2019-07-30T19:13:52Z

This pull request fixes a bug with ci-integration makefile test target.

That target incorrectly enabled coverage for runners and orquesta integration tests when ENABLE_COVERAGE environment variable was set to no. I randomly noticed then when looking at the integration tests output for a PR.

Coverage adds quite a bit of overhead, so this should speed up PR tests where we don't enable coverage.

New make coverage introduced in #4147 are quite convoluted (at least to me) so I decided to combine various make targets inside a single one - if I didn't do that, I would probably need to add at least 6 or so new targets to get everything to work correctly.

On a related note - we seem to have a lot of unused make targets. Perhaps some clean up is in order.

Kami · 2019-07-30T19:28:41Z

@blag Per Slack - if you feel adventurous you can also split it into multiple targets.

I found that multi target approach quite convoluted and failed to get it to work in a reasonable amount of time frame so I reverted to a single target approach because I don't have too much time to spend on this (I really just randomly found out this issue and I wanted to fix it since it can speed things up).

Kami · 2019-07-30T19:41:47Z

After digging into this, I also noticed that "new" Orquesta integration tests are very slow.

With this change, existing integration tests itself take 5-8 minutes out of total of 16 minutes. So that's a huge chunk of integration tests run time (https://gist.github.com/Kami/f4539b6ef980502b8488a067b12720f3):

71 tests run in 569.491 seconds�[32m (71 tests passed)

https://api.travis-ci.org/v3/job/557444949/log.txt

@m4dcoder I assume there must be some "quick wins" aka performance optimizations possible for those tests (aka reduce some expensive setup or similar, I didn't have time to dig in)?

Before those tests were added, they took on average ~12 minutes, now they take 18-22 minutes, or 15-16 with coverage fix from this PR.

For reference, here is a link to "random" build before those tests - https://travis-ci.org/StackStorm/st2/builds/376401081?utm_source=github_status&utm_medium=notification (at that time, integration tests took 6-8 minutes and unit tests were the slowest build job)

Kami · 2019-07-30T19:42:54Z

As far as fix in this PR goes - it looks like it saves us 3-6 minutes, not great, not terrible, I will take it.

Hopefully with performance optimizations in Orquesta integration tests, we can get integration tests run time down to more reasonable 10-12 minutes again (cumulatively over a longer period this would means tons and tons of saved CPU cycle and developer time waiting on builds).

Kami · 2019-07-30T20:15:29Z

On a related note - any objections to me removing all the various unused make targets?

Kami · 2019-07-30T20:39:45Z

2c1d1a5 seems to help a bit (~ 14 minutes vs 16 minutes 40 seconds), but two tests are failing.

https://travis-ci.org/StackStorm/st2/builds/565662789 vs https://travis-ci.org/StackStorm/st2/builds/565637432

I assume that's because they were designed around the idea that initial wait will always be at least 3 seconds and they will probably need to be re-designed a bit.

In addition to that change and fixing the failing tests I propose moving the "biggest offenders" from those tests to a nightly build as discussed with @m4dcoder on Slack (we will need to do the same for Python 3 tox targets and shuffle things around).

How should we handle / organize nightly tests? Perhaps have component/tests/nightly/{unit,integration}/ directory for nightly tests?

This should get us back in time frame of around 10 minutes for integration tests.

Kami · 2019-07-31T10:10:00Z

I pushed a workaround which should fix the issue with two tests failing now that the global retry delay has been decreased - 8fc1423, bbebd02.

Keep in mind that I didn't really make anything worse or change how things work.

Those two tests already contain a race and relied on timing previously. The difference is that previously all the tests were delayed for up to 2-3 minutes because of the long wait_fixed delay when there was no need for it. Now only the two tests which rely on timing / contain a race use a longer sleep.

Going forward, it would definitely be better to refactor those tests so they don't rely on specific timing of when st2client.executions.get_by_id is called (see my comment in the tests), but this issue was already there before my change and I can't fix everything as part of this PR.

/cc @m4dcoder

m4dcoder · 2019-07-31T15:24:11Z

st2tests/integration/orquesta/base.py

    action_constants.LIVEACTION_STATUS_RUNNING
 ]

+DEFAULT_WAIT_FIXED = 3000


Why not just change this to 1500 default so you don't have to override it?

Sorry, this should actually default to 500, not 3000.

I was testing many things and accidentally added a wrong change.

We should default to the lowest value possible which works for most of the tests and increase it for tests which need a higher value (like the ones which explicitly set it to 1500).

Will again verify 500 works correctly and push the change.

Ok. I'll pull down the changes later this PM and make a run myself. I still need to understand the changes you made to wait state method. I'll merge this PM if all is ok.

m4dcoder

@Kami Setting the DEFAULT_WAIT_FIXED = 500 should be sufficient without doing the inner functions and overriding wait_fixed in retrying. The wait_fixed just tells retrying to wait a fix interval between checking state again. So, 1500 just means wait a bit longer vs 500 which checks a few more times. But in the process of testing that, I uncovered a possible bug with orquesta (states flipping between failed and paused). That's probably what the wait_fixed=1500 is masking. I'm currently identifying the root cause in orquesta, which is tedious.

Kami · 2019-08-01T17:08:37Z

@m4dcoder Yep, that's why I set it to higher for some tests - I assumed there is some internal bug / race, but I didn't have the time and context to track it down so I just went with the "easy" approach (aka what the existing tests did before - just use a long wait time which "masks" / works around the real issue).

It's great to hear that you are on it though and that you potentially identified the real root cause. Using low retry time for all the the tests will allow us to speed those tests up by a lot.

Make sure we only generate coverage for runners and orquesta integration tests when ENABLE_COVERAGE environment variable is set to "yes". Coverage adds a lot of overhead so this should speed up PR builds.

invocation basis. Update tests which have a race / rely on timing to use longer wait time to avoid failure.

tests.

Making the default wait_fixed to 500 is sufficient. The override to 1500 is unnecessary due to a bug in orquesta.

Kami · 2019-08-02T08:37:49Z

Yay, integration tests are now down to 14-15 minutes from 18-22 minutes!

Thanks for wrapping this up and finding and fixing the root cause for the "race issue" in Orquesta.

Kami requested a review from blag July 30, 2019 19:13

Kami added this to the 3.2.0 milestone Jul 30, 2019

Kami added code coverage tests labels Jul 30, 2019

Kami added the tests: speed optimization label Jul 30, 2019

Kami changed the title ~~[WIP] Fix code coverage for ci-integration make tests target~~ Fix code coverage for ci-integration make tests target, speed up some Orquesta integration tests Jul 31, 2019

Kami requested a review from arm4b July 31, 2019 10:13

Kami force-pushed the speed_pr_tests branch from 3a6fa73 to 9be8199 Compare July 31, 2019 11:23

arm4b requested a review from m4dcoder July 31, 2019 13:11

m4dcoder reviewed Jul 31, 2019

View reviewed changes

m4dcoder requested changes Aug 1, 2019

View reviewed changes

arm4b removed their request for review August 1, 2019 13:38

Kami added 5 commits August 1, 2019 19:43

Fix and combine integration tests targets.

6271014

Make sure we only generate coverage for runners and orquesta integration tests when ENABLE_COVERAGE environment variable is set to "yes". Coverage adds a lot of overhead so this should speed up PR builds.

Try to decrease wait delay, see if that helps.

ad37d68

Add a work around for two tests which rely on longer retry delay.

4d0f12b

Allow wait_fixed and stop_max_delay to be provided on per method

14058b6

invocation basis. Update tests which have a race / rely on timing to use longer wait time to avoid failure.

Default it to a lowest value possible which still works for most of the

dbf7727

tests.

m4dcoder force-pushed the speed_pr_tests branch from c1190a6 to 95e3f59 Compare August 1, 2019 21:16

Revert the wait delay changes to orquesta itests

442641c

Making the default wait_fixed to 500 is sufficient. The override to 1500 is unnecessary due to a bug in orquesta.

m4dcoder force-pushed the speed_pr_tests branch from 95e3f59 to 442641c Compare August 1, 2019 21:19

m4dcoder approved these changes Aug 2, 2019

View reviewed changes

m4dcoder merged commit ee3f817 into master Aug 2, 2019

m4dcoder deleted the speed_pr_tests branch August 2, 2019 00:01

Uh oh!

Fix code coverage for ci-integration make tests target, speed up some Orquesta integration tests #4754

Fix code coverage for ci-integration make tests target, speed up some Orquesta integration tests #4754

Uh oh!

Conversation

Kami commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kami commented Jul 30, 2019

Uh oh!

Kami commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kami commented Jul 30, 2019

Uh oh!

Kami commented Jul 30, 2019

Uh oh!

Kami commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kami commented Jul 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m4dcoder Jul 31, 2019

Choose a reason for hiding this comment

Uh oh!

Kami Jul 31, 2019

Choose a reason for hiding this comment

Uh oh!

m4dcoder Jul 31, 2019

Choose a reason for hiding this comment

Uh oh!

m4dcoder left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kami commented Aug 1, 2019

Uh oh!

Kami commented Aug 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kami commented Jul 30, 2019 •

edited

Loading

Kami commented Jul 30, 2019 •

edited

Loading

Kami commented Jul 30, 2019 •

edited

Loading

Kami commented Jul 31, 2019 •

edited

Loading

m4dcoder left a comment •

edited

Loading