Fix (workaround) for very slow test runs and timeouts on Travis CI #4505

Kami · 2019-01-21T12:11:53Z

After spending a lot of time on Saturday and today I believe I managed to finally track down the issue with very slow tests on Travis.

It turns out that on Travis RabbitMQ throws a non-fatal Channel.open: (504) CHANNEL_ERROR - second 'channel.open' seen error when using our connection retry wrapper code which causes the code to "hang" and sleep for 10 seconds here https://github.com/StackStorm/st2/blob/master/st2common/st2common/transport/connection_retry_wrapper.py#L130 (the issue can be reproduced 100% of the time on Travis).

This happens very often so tests are up to 20 times and more slower on Travis than locally.

For example:

Local VM:

(virtualenv) vagrant@ubuntu-xenial:/data/stanley$ nosetests --with-timer --rednose --immediate -sv st2actions/tests/unit/test_scheduler.py
test_create_from_liveaction (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_failed_next_item (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_garbage_collection (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_next_execution (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_next_executions_empty (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_no_double_entries (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_no_processing_of_non_requested_actions (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_processing_when_task_completed (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed

TEST RESULT OUTPUT:

[success] 19.82% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_processing_when_task_completed: 1.3436s
[success] 19.79% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_no_processing_of_non_requested_actions: 1.3413s
[success] 14.81% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_failed_next_item: 1.0039s
[success] 12.30% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_next_execution: 0.8336s
[success] 11.67% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_garbage_collection: 0.7914s
[success] 10.53% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_no_double_entries: 0.7141s
[success] 8.45% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_next_executions_empty: 0.5731s
[success] 2.63% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_create_from_liveaction: 0.1780s
-----------------------------------------------------------------------------
8 tests run in 8.990 seconds (8 tests passed)

(virtualenv) travis@travis-job-4fb692bd-c140-4460-b3c1-76ca77e1819b:~/build/StackStorm/st2$ nosetests --with-timer --rednose --immediate -sv st2actions/tests/unit/test_scheduler.py
test_create_from_liveaction (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_failed_next_item (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_garbage_collection (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_next_execution (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_next_executions_empty (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_no_double_entries (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_no_processing_of_non_requested_actions (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_processing_when_task_completed (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed

TEST RESULT OUTPUT:

[success] 64.62% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_no_processing_of_non_requested_actions: 103.8978s
[success] 8.10% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_next_execution: 13.0182s
[success] 8.08% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_failed_next_item: 12.9914s
[success] 8.00% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_garbage_collection: 12.8685s
[success] 7.98% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_processing_when_task_completed: 12.8290s
[success] 1.49% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_no_double_entries: 2.3939s
[success] 1.22% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_next_executions_empty: 1.9665s
[success] 0.51% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_create_from_liveaction: 0.8142s
-----------------------------------------------------------------------------
8 tests run in 162.401 seconds (8 tests passed)

I haven't been able to track down the root cause yet aka why RabbitMQ throws that error on Travis, but not locally or on some other environment. Good news is that the workaround should be safe to use and appears to have no negative side affects. I tried things such as installing latest and previous RabbitMQ server version on Travis, but nothing made a difference. It could indicate that the issue is related to unique networking setup on Travis VMs or similar.

It appears that that issue has been present on Travis for quite a long time, but it got so bad recently we couldn't even get tests to pass due to the timing issues.

The workaround for the issue is available in 56b216b, da9177a.

I also have some other "related" PRs open, like switching from Precise to Xenial on Travis, etc. which I plan to finish now with that fix in place.

Kami · 2019-01-21T12:13:33Z

st2common/st2common/transport/connection_retry_wrapper.py

                    raise
+
                # -1, 0 and 1+ are handled properly by eventlet.sleep
+                self._logger.debug('Received RabbitMQ server error, sleeping for %s seconds '


Having this log message here would save me a lot of time trying to track down the root cause.

Kami · 2019-01-21T12:46:57Z

On a related note - I disabled the "bandit" check since it's too slow and provides little value (it takes 10+ minutes).

Kami · 2019-01-21T12:55:05Z

To give you the idea on the build timing info:

Before:

After:

And subsequent runs should be even faster because the build will be able to utilize the dependency cache (I was also working on moving from Precise to Xenial change which nuked the dependency cache).

I will go ahead and merge this once the Travis build passed to unblock myself (tons of PR and changes have been blocked because of those issues).

arm4b

Good to see the root cause resolved 👍

Kami · 2019-03-06T07:10:27Z

I also cherry picked this in v2.9.3 to avoid very slow test issues on Travis.

Kami added 2 commits January 21, 2019 13:09

Log a message when we are sleeping due to the rabbitmq connection error.

625af89

Fix the if check.

fb34ea7

Kami added this to the 3.0.0 milestone Jan 21, 2019

Kami commented Jan 21, 2019

View reviewed changes

Kami added 2 commits January 21, 2019 13:28

Add a test case for it.

b8bf79e

Pass missing argument to the method.

19a157d

Kami added tests travisci labels Jan 21, 2019

arm4b approved these changes Jan 21, 2019

View reviewed changes

Kami merged commit e97a73a into master Jan 21, 2019

Kami deleted the fix_slow_travis_tests_issues_2 branch January 21, 2019 13:07

Kami mentioned this pull request Jan 22, 2019

st2api start keep doing Registering exchanges without errors. #4508

Closed

Kami modified the milestones: 3.0.0, 2.10.2 Feb 18, 2019

Kami mentioned this pull request Feb 18, 2019

Changes for v2.10.2 release #4553

Merged

Kami mentioned this pull request Mar 6, 2019

V2.9.3 #4580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix (workaround) for very slow test runs and timeouts on Travis CI #4505

Fix (workaround) for very slow test runs and timeouts on Travis CI #4505

Uh oh!

Kami commented Jan 21, 2019 •

edited

Loading

Uh oh!

Kami Jan 21, 2019

Uh oh!

Kami commented Jan 21, 2019

Uh oh!

Kami commented Jan 21, 2019

Uh oh!

arm4b left a comment

Uh oh!

Kami commented Mar 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Fix (workaround) for very slow test runs and timeouts on Travis CI #4505

Fix (workaround) for very slow test runs and timeouts on Travis CI #4505

Uh oh!

Conversation

Kami commented Jan 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kami Jan 21, 2019

Choose a reason for hiding this comment

Uh oh!

Kami commented Jan 21, 2019

Uh oh!

Kami commented Jan 21, 2019

Uh oh!

arm4b left a comment

Choose a reason for hiding this comment

Uh oh!

Kami commented Mar 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kami commented Jan 21, 2019 •

edited

Loading