Skip to content

Conversation

@Kami
Copy link
Member

@Kami Kami commented Jul 30, 2019

This pull request fixes a bug with ci-integration makefile test target.

That target incorrectly enabled coverage for runners and orquesta integration tests when ENABLE_COVERAGE environment variable was set to no. I randomly noticed then when looking at the integration tests output for a PR.

Coverage adds quite a bit of overhead, so this should speed up PR tests where we don't enable coverage.

New make coverage introduced in #4147 are quite convoluted (at least to me) so I decided to combine various make targets inside a single one - if I didn't do that, I would probably need to add at least 6 or so new targets to get everything to work correctly.

On a related note - we seem to have a lot of unused make targets. Perhaps some clean up is in order.

@Kami Kami requested a review from blag July 30, 2019 19:13
@Kami Kami added this to the 3.2.0 milestone Jul 30, 2019
@Kami
Copy link
Member Author

Kami commented Jul 30, 2019

@blag Per Slack - if you feel adventurous you can also split it into multiple targets.

I found that multi target approach quite convoluted and failed to get it to work in a reasonable amount of time frame so I reverted to a single target approach because I don't have too much time to spend on this (I really just randomly found out this issue and I wanted to fix it since it can speed things up).

@Kami
Copy link
Member Author

Kami commented Jul 30, 2019

After digging into this, I also noticed that "new" Orquesta integration tests are very slow.

With this change, existing integration tests itself take 5-8 minutes out of total of 16 minutes. So that's a huge chunk of integration tests run time (https://gist.github.com/Kami/f4539b6ef980502b8488a067b12720f3):

71 tests run in 569.491 seconds�[32m (71 tests passed)

https://api.travis-ci.org/v3/job/557444949/log.txt

@m4dcoder I assume there must be some "quick wins" aka performance optimizations possible for those tests (aka reduce some expensive setup or similar, I didn't have time to dig in)?

Before those tests were added, they took on average ~12 minutes, now they take 18-22 minutes, or 15-16 with coverage fix from this PR.

For reference, here is a link to "random" build before those tests - https://travis-ci.org/StackStorm/st2/builds/376401081?utm_source=github_status&utm_medium=notification (at that time, integration tests took 6-8 minutes and unit tests were the slowest build job)

@Kami
Copy link
Member Author

Kami commented Jul 30, 2019

As far as fix in this PR goes - it looks like it saves us 3-6 minutes, not great, not terrible, I will take it.

Hopefully with performance optimizations in Orquesta integration tests, we can get integration tests run time down to more reasonable 10-12 minutes again (cumulatively over a longer period this would means tons and tons of saved CPU cycle and developer time waiting on builds).

@Kami
Copy link
Member Author

Kami commented Jul 30, 2019

On a related note - any objections to me removing all the various unused make targets?

@Kami
Copy link
Member Author

Kami commented Jul 30, 2019

2c1d1a5 seems to help a bit (~ 14 minutes vs 16 minutes 40 seconds), but two tests are failing.

https://travis-ci.org/StackStorm/st2/builds/565662789 vs https://travis-ci.org/StackStorm/st2/builds/565637432

I assume that's because they were designed around the idea that initial wait will always be at least 3 seconds and they will probably need to be re-designed a bit.

In addition to that change and fixing the failing tests I propose moving the "biggest offenders" from those tests to a nightly build as discussed with @m4dcoder on Slack (we will need to do the same for Python 3 tox targets and shuffle things around).

How should we handle / organize nightly tests? Perhaps have component/tests/nightly/{unit,integration}/ directory for nightly tests?

This should get us back in time frame of around 10 minutes for integration tests.

@Kami
Copy link
Member Author

Kami commented Jul 31, 2019

I pushed a workaround which should fix the issue with two tests failing now that the global retry delay has been decreased - 8fc1423, bbebd02.

Keep in mind that I didn't really make anything worse or change how things work.

Those two tests already contain a race and relied on timing previously. The difference is that previously all the tests were delayed for up to 2-3 minutes because of the long wait_fixed delay when there was no need for it. Now only the two tests which rely on timing / contain a race use a longer sleep.

Going forward, it would definitely be better to refactor those tests so they don't rely on specific timing of when st2client.executions.get_by_id is called (see my comment in the tests), but this issue was already there before my change and I can't fix everything as part of this PR.

/cc @m4dcoder

@Kami Kami changed the title [WIP] Fix code coverage for ci-integration make tests target Fix code coverage for ci-integration make tests target, speed up some Orquesta integration tests Jul 31, 2019
@Kami Kami requested a review from arm4b July 31, 2019 10:13
@Kami Kami force-pushed the speed_pr_tests branch from 3a6fa73 to 9be8199 Compare July 31, 2019 11:23
@arm4b arm4b requested a review from m4dcoder July 31, 2019 13:11
action_constants.LIVEACTION_STATUS_RUNNING
]

DEFAULT_WAIT_FIXED = 3000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just change this to 1500 default so you don't have to override it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this should actually default to 500, not 3000.

I was testing many things and accidentally added a wrong change.

We should default to the lowest value possible which works for most of the tests and increase it for tests which need a higher value (like the ones which explicitly set it to 1500).

Will again verify 500 works correctly and push the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll pull down the changes later this PM and make a run myself. I still need to understand the changes you made to wait state method. I'll merge this PM if all is ok.

Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kami Setting the DEFAULT_WAIT_FIXED = 500 should be sufficient without doing the inner functions and overriding wait_fixed in retrying. The wait_fixed just tells retrying to wait a fix interval between checking state again. So, 1500 just means wait a bit longer vs 500 which checks a few more times. But in the process of testing that, I uncovered a possible bug with orquesta (states flipping between failed and paused). That's probably what the wait_fixed=1500 is masking. I'm currently identifying the root cause in orquesta, which is tedious.

@arm4b arm4b removed their request for review August 1, 2019 13:38
@Kami
Copy link
Member Author

Kami commented Aug 1, 2019

@m4dcoder Yep, that's why I set it to higher for some tests - I assumed there is some internal bug / race, but I didn't have the time and context to track it down so I just went with the "easy" approach (aka what the existing tests did before - just use a long wait time which "masks" / works around the real issue).

It's great to hear that you are on it though and that you potentially identified the real root cause. Using low retry time for all the the tests will allow us to speed those tests up by a lot.

Kami added 5 commits August 1, 2019 19:43
Make sure we only generate coverage for runners and orquesta integration
tests when ENABLE_COVERAGE environment variable is set to "yes".

Coverage adds a lot of overhead so this should speed up PR builds.
invocation basis.

Update tests which have a race / rely on timing to use longer wait time
to avoid failure.
Making the default wait_fixed to 500 is sufficient. The override to 1500 is unnecessary due to a bug in orquesta.
@m4dcoder m4dcoder merged commit ee3f817 into master Aug 2, 2019
@m4dcoder m4dcoder deleted the speed_pr_tests branch August 2, 2019 00:01
@Kami
Copy link
Member Author

Kami commented Aug 2, 2019

Yay, integration tests are now down to 14-15 minutes from 18-22 minutes!

Thanks for wrapping this up and finding and fixing the root cause for the "race issue" in Orquesta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants