Skip to content

Conversation

@mike-tutkowski
Copy link
Member

https://issues.apache.org/jira/browse/CLOUDSTACK-9374

From the ticket:

In the base.py file, there is a Host class with a delete instance method.

This method first attempts to transition the host into the maintenance resource state.

The first step in this process is to transition the host into the prepare-for-maintenance resource state.

A while later, the host can be transitioned completely into the maintenance resource state.

In an attempt to wait for this transition to occur, the delete method has a timer.sleep(30) call.

The hope is that the host will have transitioned from the prepare-for-maintenance resource state to the maintenance resource state within 30 seconds, but this does not always happen.

We should correct this problem by putting in logic to query the management server for the resource state of the host. If it's in the expected state, move on; else, sleep for a bit and try again (up to a certain limit).

@mike-tutkowski
Copy link
Member Author

I tested this by walking through with the debugger when the delete method on Host was invoked from a test script of mine.

@mike-tutkowski mike-tutkowski force-pushed the marvin_replace_sleep branch from 6945d37 to 5cf46a1 Compare May 3, 2016 18:06
@jburwell
Copy link
Contributor

jburwell commented May 3, 2016

@mike-tutkowski I like this change as it is more accurate way of determining that the host is available than simply sleeping for an arbitrary period of time. you may want to consider refactoring to use the wait_until function in Marvin's utils.py. It provides a concise, Pythonic way of waiting for a condition. For example of its usage, see the test_host_maintenance.py where it is used to wait until VM migration has completed before proceeding with the test.

@mike-tutkowski mike-tutkowski force-pushed the marvin_replace_sleep branch 4 times, most recently from b941af9 to c0b5298 Compare May 3, 2016 21:50
@mike-tutkowski
Copy link
Member Author

@jburwell Thanks for pointing out that utility method. I have updated the code.


validationresult = validateList(hosts)

if validationresult[0] == FAIL:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if validationresult is None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you, @jburwell, but the API description points out that the return type is a List with three items (and what each item indicates). We can certainly validate return types, but it adds overhead (both in terms of execution time and extra logic clouding things up) for limited value here (it will just throw an exception if the value happened to be None).

Thoughts on that?

Copy link
Contributor

@jburwell jburwell May 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mike-tutkowski yes, the exception will fail the test. However, the result is an unfriendly stack trace that does not explain the context and expectations of the failure. I suggest asserting these conditions with a messaging explaining the expectations (e.g. "Expected to find a host with an id of %d".format(hostid)).

I don't feel this issue is enough to hold back the PR. However, I think if we consistently asserted on our expectations/assumptions, test failures would be easier to comprehend and diagnose. Not only does the practice provide a human readable explication, but it also fails fast. All too often, tests failures surface side-effects of failed assumptions multiple steps before the error occurred.

(Apologies for the mini PR rant)

@jburwell
Copy link
Contributor

jburwell commented May 3, 2016

@mike-tutkowski the wait_until version looks very nice ;) A definite improvement in the reliability of the test case.

I added a couple of minor comments which, hopefully, will be straightforward to address.

@mike-tutkowski
Copy link
Member Author

Thanks @jburwell While you're in a code-review mood, maybe you take a look at #1403 again as that one is pretty high value to a bunch of people. :)

@mike-tutkowski mike-tutkowski force-pushed the marvin_replace_sleep branch from c0b5298 to 0c3704f Compare May 3, 2016 23:44
@jburwell
Copy link
Contributor

jburwell commented May 4, 2016

@mike-tutkowski I apologize for being behind on my review queue. I will move #1403 up on my list and get to it as quickly as I can.

@mike-tutkowski
Copy link
Member Author

@jburwell How's about this?

assert validationresult is not None and isinstance(validationresult, list) and len(validationresult) == 3,
"'validationresult' should be a list with three items in it."

@mike-tutkowski
Copy link
Member Author

@jburwell I actually like this better because it tells me specifically what the particular failure is.

    assert validationresult is not None, "'validationresult' should not be equal to 'None'."

    assert isinstance(validationresult, list), "'validationresult' should be a 'list'."

    assert len(validationresult) == 3, "'validationresult' should be a list with three items in it."

@mike-tutkowski mike-tutkowski force-pushed the marvin_replace_sleep branch from 0c3704f to 236b5bd Compare May 4, 2016 05:02
@jburwell
Copy link
Contributor

jburwell commented May 4, 2016

@mike-tutkowski I agree with you about the use of assertions. A very nice improvement in the quality of the test.

LGTM based on code review

@mike-tutkowski
Copy link
Member Author

Thanks @jburwell and @shwetaag for the reviews!

@swill We are good to go from a review standpoint here. I will update the commit SHA and re-push this because there appears to be weird stuff going on with Jenkins and Travis.

@mike-tutkowski mike-tutkowski force-pushed the marvin_replace_sleep branch 2 times, most recently from a771d4d to 5e98eb9 Compare May 5, 2016 13:20
@dmabry
Copy link
Contributor

dmabry commented May 6, 2016

I know it doesn't really need my LGTM, but this commit definitely improves the accuracy, and possibly the performance, of marvin and I'd personally like to see this one merged. I have run a few Marvin tests that use this code and it does work as designed.

@swill CI Test good, 2+ reviews. This is Ready to Merge.

@swill
Copy link
Contributor

swill commented May 6, 2016

@dmabry thank you. This type of feedback is very useful for me (as the RM)

@mike-tutkowski mike-tutkowski force-pushed the marvin_replace_sleep branch from 5e98eb9 to eeb3373 Compare May 7, 2016 01:21
@swill
Copy link
Contributor

swill commented May 9, 2016

CI RESULTS

Tests Run: 85
  Skipped: 0
   Failed: 2
   Errors: 1
 Duration: 10h 39m 56s

Summary of the problem(s):

ERROR: Test to verify access to loadbalancer haproxy admin stats page
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_internal_lb.py", line 854, in tearDown
    raise Exception("Cleanup failed with %s" % e)
Exception: Cleanup failed with Job failed: {jobprocstatus : 0, created : u'2016-05-07T12:42:29+0200', jobresult : {errorcode : 530, errortext : u'Failed to delete network'}, cmd : u'org.apache.cloudstack.api.command.user.network.DeleteNetworkCmd', userid : u'31f149c9-1410-11e6-9280-5254001daa61', jobstatus : 2, jobid : u'0020223d-d484-443c-800e-9ee1e716ee7a', jobresultcode : 530, jobresulttype : u'object', jobinstancetype : u'Network', accountid : u'31f12b92-1410-11e6-9280-5254001daa61'}
----------------------------------------------------------------------
Additional details in: /tmp/MarvinLogs/test_network_44MNJ7/results.txt
FAIL: test_02_vpc_privategw_static_routes (integration.smoke.test_privategw_acl.TestPrivateGwACL)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 253, in test_02_vpc_privategw_static_routes
    self.performVPCTests(vpc_off)
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 324, in performVPCTests
    self.check_pvt_gw_connectivity(vm1, public_ip_1, vm2.nic[0].ipaddress)
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 559, in check_pvt_gw_connectivity
    "Ping to outside world from VM should be successful"
AssertionError: Ping to outside world from VM should be successful
----------------------------------------------------------------------
Additional details in: /tmp/MarvinLogs/test_network_44MNJ7/results.txt
FAIL: Test destroy(expunge) Virtual Machine
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_vm_life_cycle.py", line 646, in test_09_expunge_vm
    self.assertEqual(list_vm_response,None,"Check Expunged virtual machine is in listVirtualMachines response")
AssertionError: Check Expunged virtual machine is in listVirtualMachines response
----------------------------------------------------------------------
Additional details in: /tmp/MarvinLogs/test_vpc_routers_BJDKJO/results.txt

Associated Uploads

/tmp/MarvinLogs/DeployDataCenter__May_07_2016_07_00_21_EJXJA1:

/tmp/MarvinLogs/test_network_44MNJ7:

/tmp/MarvinLogs/test_vpc_routers_BJDKJO:

Uploads will be available until 2016-07-09 02:00:00 +0200 CEST

Comment created by upr comment.

@swill
Copy link
Contributor

swill commented May 9, 2016

All tests are done and I have the code reviews I need. Adding to merge queue. Thx...

@asfgit asfgit merged commit eeb3373 into apache:master May 11, 2016
asfgit pushed a commit that referenced this pull request May 11, 2016
Marvin: Replace a timer.sleep(30) with pulling logichttps://issues.apache.org/jira/browse/CLOUDSTACK-9374

From the ticket:

In the base.py file, there is a Host class with a delete instance method.

This method first attempts to transition the host into the maintenance resource state.

The first step in this process is to transition the host into the prepare-for-maintenance resource state.

A while later, the host can be transitioned completely into the maintenance resource state.

In an attempt to wait for this transition to occur, the delete method has a timer.sleep(30) call.

The hope is that the host will have transitioned from the prepare-for-maintenance resource state to the maintenance resource state within 30 seconds, but this does not always happen.

We should correct this problem by putting in logic to query the management server for the resource state of the host. If it's in the expected state, move on; else, sleep for a bit and try again (up to a certain limit).

* pr/1529:
  Replace a timer.sleep(30) with pulling logic

Signed-off-by: Will Stevens <williamstevens@gmail.com>
@mike-tutkowski mike-tutkowski deleted the marvin_replace_sleep branch May 11, 2016 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants