Skip to content

Conversation

@Kami
Copy link
Member

@Kami Kami commented Jan 21, 2019

After spending a lot of time on Saturday and today I believe I managed to finally track down the issue with very slow tests on Travis.

It turns out that on Travis RabbitMQ throws a non-fatal Channel.open: (504) CHANNEL_ERROR - second 'channel.open' seen error when using our connection retry wrapper code which causes the code to "hang" and sleep for 10 seconds here https://github.com/StackStorm/st2/blob/master/st2common/st2common/transport/connection_retry_wrapper.py#L130 (the issue can be reproduced 100% of the time on Travis).

This happens very often so tests are up to 20 times and more slower on Travis than locally.

For example:

Local VM:

(virtualenv) vagrant@ubuntu-xenial:/data/stanley$ nosetests --with-timer --rednose --immediate -sv st2actions/tests/unit/test_scheduler.py
test_create_from_liveaction (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_failed_next_item (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_garbage_collection (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_next_execution (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_next_executions_empty (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_no_double_entries (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_no_processing_of_non_requested_actions (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_processing_when_task_completed (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed

TEST RESULT OUTPUT:

[success] 19.82% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_processing_when_task_completed: 1.3436s
[success] 19.79% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_no_processing_of_non_requested_actions: 1.3413s
[success] 14.81% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_failed_next_item: 1.0039s
[success] 12.30% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_next_execution: 0.8336s
[success] 11.67% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_garbage_collection: 0.7914s
[success] 10.53% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_no_double_entries: 0.7141s
[success] 8.45% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_next_executions_empty: 0.5731s
[success] 2.63% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_create_from_liveaction: 0.1780s
-----------------------------------------------------------------------------
8 tests run in 8.990 seconds (8 tests passed)

(virtualenv) travis@travis-job-4fb692bd-c140-4460-b3c1-76ca77e1819b:~/build/StackStorm/st2$ nosetests --with-timer --rednose --immediate -sv st2actions/tests/unit/test_scheduler.py
test_create_from_liveaction (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_failed_next_item (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_garbage_collection (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_next_execution (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_next_executions_empty (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_no_double_entries (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_no_processing_of_non_requested_actions (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed
test_processing_when_task_completed (tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest) ... passed

TEST RESULT OUTPUT:

[success] 64.62% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_no_processing_of_non_requested_actions: 103.8978s
[success] 8.10% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_next_execution: 13.0182s
[success] 8.08% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_failed_next_item: 12.9914s
[success] 8.00% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_garbage_collection: 12.8685s
[success] 7.98% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_processing_when_task_completed: 12.8290s
[success] 1.49% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_no_double_entries: 2.3939s
[success] 1.22% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_next_executions_empty: 1.9665s
[success] 0.51% tests.unit.test_scheduler.ActionExecutionSchedulingQueueItemDBTest.test_create_from_liveaction: 0.8142s
-----------------------------------------------------------------------------
8 tests run in 162.401 seconds (8 tests passed)

I haven't been able to track down the root cause yet aka why RabbitMQ throws that error on Travis, but not locally or on some other environment. Good news is that the workaround should be safe to use and appears to have no negative side affects. I tried things such as installing latest and previous RabbitMQ server version on Travis, but nothing made a difference. It could indicate that the issue is related to unique networking setup on Travis VMs or similar.

It appears that that issue has been present on Travis for quite a long time, but it got so bad recently we couldn't even get tests to pass due to the timing issues.

The workaround for the issue is available in 56b216b, da9177a.

I also have some other "related" PRs open, like switching from Precise to Xenial on Travis, etc. which I plan to finish now with that fix in place.

@Kami Kami added this to the 3.0.0 milestone Jan 21, 2019
raise

# -1, 0 and 1+ are handled properly by eventlet.sleep
self._logger.debug('Received RabbitMQ server error, sleeping for %s seconds '
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this log message here would save me a lot of time trying to track down the root cause.

@Kami
Copy link
Member Author

Kami commented Jan 21, 2019

On a related note - I disabled the "bandit" check since it's too slow and provides little value (it takes 10+ minutes).

@Kami
Copy link
Member Author

Kami commented Jan 21, 2019

To give you the idea on the build timing info:

Before:

screenshot from 2019-01-21 13-52-39

After:

screenshot from 2019-01-21 13-54-33

And subsequent runs should be even faster because the build will be able to utilize the dependency cache (I was also working on moving from Precise to Xenial change which nuked the dependency cache).

I will go ahead and merge this once the Travis build passed to unblock myself (tons of PR and changes have been blocked because of those issues).

Copy link
Member

@arm4b arm4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see the root cause resolved 👍

@Kami Kami merged commit e97a73a into master Jan 21, 2019
@Kami Kami deleted the fix_slow_travis_tests_issues_2 branch January 21, 2019 13:07
@Kami Kami modified the milestones: 3.0.0, 2.10.2 Feb 18, 2019
@Kami
Copy link
Member Author

Kami commented Mar 6, 2019

I also cherry picked this in v2.9.3 to avoid very slow test issues on Travis.

@Kami Kami mentioned this pull request Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants