Skip to content

Bug 1731263: Reenable basic pod preemption e2es#23758

Closed
damemi wants to merge 1 commit intoopenshift:masterfrom
damemi:preemption-no-flaky
Closed

Bug 1731263: Reenable basic pod preemption e2es#23758
damemi wants to merge 1 commit intoopenshift:masterfrom
damemi:preemption-no-flaky

Conversation

@damemi
Copy link
Copy Markdown
Contributor

@damemi damemi commented Sep 10, 2019

This test was marked as flaky but has had a number of fixes applied to it (pr #23645 and #23719) tht have adjusted the node utilization calculations and structure of the test to hopefully make it more stable. In order to address the comments in the BZ for this test's flakiness (https://bugzilla.redhat.com/show_bug.cgi?id=1731263#c9) we should remove the [flaky] tag to re-enable the tests.

@openshift-ci-robot
Copy link
Copy Markdown

@damemi: This pull request references Bugzilla bug 1731263, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1731263: Removes [flaky] from basic pod preemption e2e to re-enable it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 10, 2019
@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Sep 10, 2019

@deads2k you mentioned that the [Flaky] tag is how to disable these tests, could you confirm on this PR that it will reenable it?

@damemi damemi force-pushed the preemption-no-flaky branch from 183cd6d to abb0c6b Compare September 10, 2019 15:29
@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Sep 10, 2019

/retest
this still flaking?

@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Sep 10, 2019

/hold
I think these tests need to just be rewritten, there are some spots that don't make sense to me

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2019
@soltysh
Copy link
Copy Markdown
Contributor

soltysh commented Sep 11, 2019

I think these tests need to just be rewritten, there are some spots that don't make sense to me

@damemi how about checking what the newer version will look like? I mean that from k8s 1.16 introduced in the rebase and look at it only after rebase lands. I'm moving the BZ to 4.3

@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Sep 11, 2019

@soltysh This test is actually ahead of 1.16 right now (origin already has the changes in kubernetes/kubernetes#82350 which hasn't merged)

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 11, 2019
@damemi damemi force-pushed the preemption-no-flaky branch from df5de77 to 683c7b5 Compare September 11, 2019 20:31
@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Sep 16, 2019

/retest

@soltysh
Copy link
Copy Markdown
Contributor

soltysh commented Oct 14, 2019

@damemi what's the call here?

@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Oct 14, 2019

@soltysh I need to revisit this test entirely, it's not flaking at all upstream but when I submitted this PR to remove [flaky] it still fails CI

@soltysh
Copy link
Copy Markdown
Contributor

soltysh commented Nov 5, 2019

@damemi what's the current state of this?

@damemi damemi closed this Nov 7, 2019
@damemi damemi force-pushed the preemption-no-flaky branch from 683c7b5 to 78cebd2 Compare November 7, 2019 18:39
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 7, 2019
@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Nov 7, 2019

It seems that this test was removed from [flaky] status with the k8s bump in #23811, so it should be re-enabled now I think...

Though, I can't find the test in 4.2 testgrid or 4.3

but I do see preemption tests in, for example, 4.1 blocking: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.1-blocking#release-openshift-origin-installer-e2e-aws-serial-4.1. Maybe I'm just not finding it, so I'll switch the BZ back to modified and have QA take a look as well

@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Nov 7, 2019

/reopen

@openshift-ci-robot
Copy link
Copy Markdown

@damemi: Reopened this PR.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Nov 7, 2019
@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Nov 7, 2019
@openshift-ci-robot
Copy link
Copy Markdown

@damemi: This pull request references Bugzilla bug 1731263, which is valid.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Nov 7, 2019

/retest

@damemi damemi changed the title Bug 1731263: Removes [flaky] from basic pod preemption e2e to re-enable it Bug 1731263: Reenable basic pod preemption e2es Nov 7, 2019
@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Nov 8, 2019

/retest

@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Nov 8, 2019

So I noticed that, with our changes to this test, the pod wasn't always being scheduled onto the node we expected (for example, if there is a low priority pod, we were assuming that the preemptor pod would be placed on the same node as the low priority pod, which wasn't always the case, and so the test failed with no pods preempted and the preemptor pod running).

Now I've made an update that ensures that the preemptor pod will land on the appropriate node. However, it seems that preemption just isn't running, because now the preemptor pod is attempting to schedule on the correct node, but failing due to lack of cpu:

event for preemptor-pod: {kubelet ip-10-0-141-255.ec2.internal} OutOfcpu: Node didn't have enough resource: cpu, requested: 1686, used: 2176, capacity: 3500
...
Nov  8 18:37:20.016: INFO: pod0-sched-preemption-low-priority     ip-10-0-141-255.ec2.internal  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  }]
Nov  8 18:37:20.016: INFO: pod1-sched-preemption-medium-priority  ip-10-0-142-166.ec2.internal  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  }]
Nov  8 18:37:20.016: INFO: pod2-sched-preemption-medium-priority  ip-10-0-154-84.ec2.internal   Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  }]
Nov  8 18:37:20.016: INFO: preemptor-pod                          ip-10-0-141-255.ec2.internal  Failed          []

@ravisantoshgudimetla, do we need to enable podpreemption in our e2es somehow? Failing to schedule due to lack of cpu is exactly what we want to test

Update: actually this is interesting, looking at the usages as the victim pods are scheduled:

Nov  8 18:37:09.815: INFO: Current cpu and memory usage 690, 1572864000
Nov  8 18:37:09.842: INFO: Created pod: pod0-sched-preemption-low-priority
Nov  8 18:37:09.842: INFO: Current cpu and memory usage 1950, 4690280448
Nov  8 18:37:09.864: INFO: Created pod: pod1-sched-preemption-medium-priority
Nov  8 18:37:09.864: INFO: Current cpu and memory usage 2380, 4982833152
Nov  8 18:37:09.889: INFO: Created pod: pod2-sched-preemption-medium-priority

it appears that the low priority pod is requesting 1950-690=1260 cpu, but in the log I posted above the preemptor is requesting 1686, which is closer to the CPU usage after scheduling the pod1-medium-priority (2380-690=1690). That may just be a coincidence so I'm going to increase logging in this test to make sure we are requesting the appropriate amount for our preemptor.

However, I am still curious about my first question regarding if we need to do anything to enable preemption here?

@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Nov 12, 2019

/retest

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 13, 2019
@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Nov 13, 2019

/retest

@openshift-bot
Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 13, 2020
@damemi damemi force-pushed the preemption-no-flaky branch from c45b9d9 to f2cceac Compare March 3, 2020 18:08
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 3, 2020
@openshift-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: damemi
To complete the pull request process, please assign smarterclayton
You can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Mar 3, 2020

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2020
@damemi damemi force-pushed the preemption-no-flaky branch from f2cceac to 222af67 Compare March 3, 2020 19:23
@ingvagabund
Copy link
Copy Markdown
Member

/retest

1 similar comment
@damemi
Copy link
Copy Markdown
Contributor Author

damemi commented Mar 4, 2020

/retest

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 4, 2020

@damemi: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-upgrade 683c7b5 link /test e2e-aws-upgrade
ci/prow/e2e-aws-fips 222af67 link /test e2e-aws-fips
ci/prow/e2e-aws-serial 222af67 link /test e2e-aws-serial
ci/prow/e2e-gcp 222af67 link /test e2e-gcp

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@soltysh
Copy link
Copy Markdown
Contributor

soltysh commented Apr 21, 2020

Irrelevant.
/close

@openshift-ci-robot
Copy link
Copy Markdown

@soltysh: Closed this PR.

Details

In response to this:

Irrelevant.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants