Bug 1731263: Reenable basic pod preemption e2es by damemi · Pull Request #23758 · openshift/origin

damemi · 2019-09-10T15:13:47Z

This test was marked as flaky but has had a number of fixes applied to it (pr #23645 and #23719) tht have adjusted the node utilization calculations and structure of the test to hopefully make it more stable. In order to address the comments in the BZ for this test's flakiness (https://bugzilla.redhat.com/show_bug.cgi?id=1731263#c9) we should remove the [flaky] tag to re-enable the tests.

openshift-ci-robot · 2019-09-10T15:13:52Z

@damemi: This pull request references Bugzilla bug 1731263, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1731263: Removes [flaky] from basic pod preemption e2e to re-enable it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

damemi · 2019-09-10T15:25:21Z

@deads2k you mentioned that the [Flaky] tag is how to disable these tests, could you confirm on this PR that it will reenable it?

damemi · 2019-09-10T19:00:15Z

/retest
this still flaking?

damemi · 2019-09-10T19:09:23Z

/hold
I think these tests need to just be rewritten, there are some spots that don't make sense to me

soltysh · 2019-09-11T10:01:43Z

I think these tests need to just be rewritten, there are some spots that don't make sense to me

@damemi how about checking what the newer version will look like? I mean that from k8s 1.16 introduced in the rebase and look at it only after rebase lands. I'm moving the BZ to 4.3

damemi · 2019-09-11T13:13:54Z

@soltysh This test is actually ahead of 1.16 right now (origin already has the changes in kubernetes/kubernetes#82350 which hasn't merged)

damemi · 2019-09-16T19:54:51Z

/retest

soltysh · 2019-10-14T13:09:53Z

@damemi what's the call here?

damemi · 2019-10-14T14:19:28Z

@soltysh I need to revisit this test entirely, it's not flaking at all upstream but when I submitted this PR to remove [flaky] it still fails CI

soltysh · 2019-11-05T11:14:06Z

@damemi what's the current state of this?

damemi · 2019-11-07T18:56:55Z

It seems that this test was removed from [flaky] status with the k8s bump in #23811, so it should be re-enabled now I think...

Though, I can't find the test in 4.2 testgrid or 4.3

but I do see preemption tests in, for example, 4.1 blocking: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.1-blocking#release-openshift-origin-installer-e2e-aws-serial-4.1. Maybe I'm just not finding it, so I'll switch the BZ back to modified and have QA take a look as well

damemi · 2019-11-07T19:58:52Z

/reopen

openshift-ci-robot · 2019-11-07T19:58:54Z

@damemi: Reopened this PR.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-11-07T20:26:58Z

@damemi: This pull request references Bugzilla bug 1731263, which is valid.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

damemi · 2019-11-07T20:54:37Z

/retest

damemi · 2019-11-08T13:57:41Z

/retest

damemi · 2019-11-08T20:43:15Z

So I noticed that, with our changes to this test, the pod wasn't always being scheduled onto the node we expected (for example, if there is a low priority pod, we were assuming that the preemptor pod would be placed on the same node as the low priority pod, which wasn't always the case, and so the test failed with no pods preempted and the preemptor pod running).

Now I've made an update that ensures that the preemptor pod will land on the appropriate node. However, it seems that preemption just isn't running, because now the preemptor pod is attempting to schedule on the correct node, but failing due to lack of cpu:

event for preemptor-pod: {kubelet ip-10-0-141-255.ec2.internal} OutOfcpu: Node didn't have enough resource: cpu, requested: 1686, used: 2176, capacity: 3500
...
Nov  8 18:37:20.016: INFO: pod0-sched-preemption-low-priority     ip-10-0-141-255.ec2.internal  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  }]
Nov  8 18:37:20.016: INFO: pod1-sched-preemption-medium-priority  ip-10-0-142-166.ec2.internal  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  }]
Nov  8 18:37:20.016: INFO: pod2-sched-preemption-medium-priority  ip-10-0-154-84.ec2.internal   Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:18 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-11-08 18:37:09 +0000 UTC  }]
Nov  8 18:37:20.016: INFO: preemptor-pod                          ip-10-0-141-255.ec2.internal  Failed          []

@ravisantoshgudimetla, do we need to enable podpreemption in our e2es somehow? Failing to schedule due to lack of cpu is exactly what we want to test

Update: actually this is interesting, looking at the usages as the victim pods are scheduled:

Nov  8 18:37:09.815: INFO: Current cpu and memory usage 690, 1572864000
Nov  8 18:37:09.842: INFO: Created pod: pod0-sched-preemption-low-priority
Nov  8 18:37:09.842: INFO: Current cpu and memory usage 1950, 4690280448
Nov  8 18:37:09.864: INFO: Created pod: pod1-sched-preemption-medium-priority
Nov  8 18:37:09.864: INFO: Current cpu and memory usage 2380, 4982833152
Nov  8 18:37:09.889: INFO: Created pod: pod2-sched-preemption-medium-priority

it appears that the low priority pod is requesting 1950-690=1260 cpu, but in the log I posted above the preemptor is requesting 1686, which is closer to the CPU usage after scheduling the pod1-medium-priority (2380-690=1690). That may just be a coincidence so I'm going to increase logging in this test to make sure we are requesting the appropriate amount for our preemptor.

However, I am still curious about my first question regarding if we need to do anything to enable preemption here?

damemi · 2019-11-12T19:18:36Z

/retest

damemi · 2019-11-13T21:07:21Z

/retest

openshift-bot · 2020-02-13T01:20:44Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci-robot · 2020-03-03T18:09:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: damemi
To complete the pull request process, please assign smarterclayton
You can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/extended/util/annotate/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

damemi · 2020-03-03T18:09:30Z

/remove-lifecycle stale

ingvagabund · 2020-03-03T22:37:51Z

/retest

damemi · 2020-03-04T18:04:55Z

/retest

openshift-ci-robot · 2020-03-04T22:11:56Z

@damemi: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-upgrade	`683c7b5`	link	`/test e2e-aws-upgrade`
ci/prow/e2e-aws-fips	`222af67`	link	`/test e2e-aws-fips`
ci/prow/e2e-aws-serial	`222af67`	link	`/test e2e-aws-serial`
ci/prow/e2e-gcp	`222af67`	link	`/test e2e-gcp`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

soltysh · 2020-04-21T10:21:35Z

Irrelevant.
/close

openshift-ci-robot · 2020-04-21T10:21:58Z

@soltysh: Closed this PR.

Details

In response to this:

Irrelevant.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 10, 2019

openshift-ci-robot requested review from bparees and sjenning September 10, 2019 15:14

damemi force-pushed the preemption-no-flaky branch from 183cd6d to abb0c6b Compare September 10, 2019 15:29

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2019

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 11, 2019

damemi force-pushed the preemption-no-flaky branch from df5de77 to 683c7b5 Compare September 11, 2019 20:31

damemi closed this Nov 7, 2019

damemi force-pushed the preemption-no-flaky branch from 683c7b5 to 78cebd2 Compare November 7, 2019 18:39

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 7, 2019

openshift-ci-robot reopened this Nov 7, 2019

openshift-ci-robot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Nov 7, 2019

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Nov 7, 2019

damemi changed the title ~~Bug 1731263: Removes [flaky] from basic pod preemption e2e to re-enable it~~ Bug 1731263: Reenable basic pod preemption e2es Nov 7, 2019

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 13, 2019

damemi mentioned this pull request Nov 25, 2019

Promote SchedulerPreemption e2e tests to Conformance kubernetes/kubernetes#80545

Merged

openshift-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 13, 2020

damemi force-pushed the preemption-no-flaky branch from c45b9d9 to f2cceac Compare March 3, 2020 18:08

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 3, 2020

openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2020

Reenable scheduler preemption tests

222af67

damemi force-pushed the preemption-no-flaky branch from f2cceac to 222af67 Compare March 3, 2020 19:23

openshift-ci-robot closed this Apr 21, 2020

Conversation

damemi commented Sep 10, 2019

Uh oh!

openshift-ci-robot commented Sep 10, 2019

Uh oh!

damemi commented Sep 10, 2019

Uh oh!

damemi commented Sep 10, 2019

Uh oh!

damemi commented Sep 10, 2019

Uh oh!

soltysh commented Sep 11, 2019

Uh oh!

damemi commented Sep 11, 2019

Uh oh!

damemi commented Sep 16, 2019

Uh oh!

soltysh commented Oct 14, 2019

Uh oh!

damemi commented Oct 14, 2019

Uh oh!

soltysh commented Nov 5, 2019

Uh oh!

damemi commented Nov 7, 2019

Uh oh!

damemi commented Nov 7, 2019

Uh oh!

openshift-ci-robot commented Nov 7, 2019

Uh oh!

openshift-ci-robot commented Nov 7, 2019

Uh oh!

damemi commented Nov 7, 2019

Uh oh!

damemi commented Nov 8, 2019

Uh oh!

damemi commented Nov 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damemi commented Nov 12, 2019

Uh oh!

damemi commented Nov 13, 2019

Uh oh!

openshift-bot commented Feb 13, 2020

Uh oh!

openshift-ci-robot commented Mar 3, 2020

Uh oh!

damemi commented Mar 3, 2020

Uh oh!

ingvagabund commented Mar 3, 2020

Uh oh!

damemi commented Mar 4, 2020

Uh oh!

openshift-ci-robot commented Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soltysh commented Apr 21, 2020

Uh oh!

openshift-ci-robot commented Apr 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

damemi commented Nov 8, 2019 •

edited

Loading

openshift-ci-robot commented Mar 4, 2020 •

edited

Loading