Skip to content

test/e2e: scheduling: disable preemption tests#23029

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
sjenning:disable-sched-preemption-serial-test
Jun 5, 2019
Merged

test/e2e: scheduling: disable preemption tests#23029
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
sjenning:disable-sched-preemption-serial-test

Conversation

@sjenning
Copy link
Copy Markdown
Contributor

@sjenning sjenning commented Jun 4, 2019

After Prometheus starting making reasonable memory requests, the assumptions that the SchedulerPreemption tests make about the scheduled load on test nodes do not hold (i.e. less than 40% of capacity is scheduled).

Example e2e failure
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/598

Jun  4 15:35:22.846: INFO: At 2019-06-04 15:35:10 +0000 UTC - event for pod0-sched-preemption-low-priority: {default-scheduler } Scheduled: Successfully assigned sched-preemption-2604/pod0-sched-preemption-low-priority to ip-10-0-138-81.ec2.internal
Jun  4 15:35:22.846: INFO: At 2019-06-04 15:35:10 +0000 UTC - event for pod0-sched-preemption-low-priority: {default-scheduler } Preempted: by sched-preemption-2604/pod1-sched-preemption-medium-priority on node ip-10-0-138-81.ec2.internal
Jun  4 15:35:22.846: INFO: At 2019-06-04 15:35:10 +0000 UTC - event for pod1-sched-preemption-medium-priority: {default-scheduler } FailedScheduling: 0/6 nodes are available: 1 Insufficient memory, 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.

Flaking since openshift/prometheus-operator#30 which allowed the resource requests for the prometheus statefulset to flow through
https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-serial-4.2&sort-by-flakiness=

BZ to track reenablement
https://bugzilla.redhat.com/show_bug.cgi?id=1717198

@smarterclayton @wking @ravisantoshgudimetla

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 4, 2019
@wking
Copy link
Copy Markdown
Member

wking commented Jun 4, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2019
@openshift-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sjenning, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Copy Markdown
Member

wking commented Jun 5, 2019

/retest

Now that openshift/cluster-kube-apiserver-operator#495 has landed.

@openshift-merge-robot openshift-merge-robot merged commit 4d45bb3 into openshift:master Jun 5, 2019
wking added a commit to wking/openshift-release that referenced this pull request Jun 11, 2019
Prometheus starting making memory requests with
openshift/prometheus-operator@cda68a3f (Merge pull request
openshift/prometheus-operator#30 from paulfantom/merge-release-0.30.1,
2019-06-04):

  $ diff -u <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/591/artifacts/e2e-aws-serial/pods.json | jq '.items[] | select(.metadata.name | contains("prometheus")) | {name: .metadata.name, resources: [.spec.containers[].resources | select((. | length) > 0)]}') <(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/592/artifacts/e2e-aws-serial/pods.json | jq '.items[] | select(.metadata.name | contains("prometheus")) | {name: .metadata.name, resources: [.spec.containers[].resources | select((. | length) > 0)]}')
  --- /dev/fd/63    2019-06-04 14:10:31.908436038 -0700
  +++ /dev/fd/62    2019-06-04 14:10:31.908436038 -0700
  @@ -1,5 +1,5 @@
  {
  -  "name": "prometheus-adapter-5f78cc955d-2899k",
  +  "name": "prometheus-adapter-64f4f64b7-pvmhn",
    "resources": [
      {
        "requests": {
  @@ -10,7 +10,7 @@
    ]
  }
  {
  -  "name": "prometheus-adapter-5f78cc955d-2rlnx",
  +  "name": "prometheus-adapter-64f4f64b7-tgnld",
    "resources": [
      {
        "requests": {
  @@ -22,14 +22,56 @@
  }
  {
    "name": "prometheus-k8s-0",
  -  "resources": []
  +  "resources": [
  +    {
  +      "limits": {
  +        "cpu": "100m",
  +        "memory": "25Mi"
  +      },
  +      "requests": {
  +        "cpu": "100m",
  +        "memory": "25Mi"
  +      }
  +    },
  +    {
  +      "limits": {
  +        "cpu": "100m",
  +        "memory": "25Mi"
  +      },
  +      "requests": {
  +        "cpu": "100m",
  +        "memory": "25Mi"
  +      }
  +    }
  +  ]
  }
  {
    "name": "prometheus-k8s-1",
  -  "resources": []
  +  "resources": [
  +    {
  +      "limits": {
  +        "cpu": "100m",
  +        "memory": "25Mi"
  +      },
  +      "requests": {
  +        "cpu": "100m",
  +        "memory": "25Mi"
  +      }
  +    },
  +    {
  +      "limits": {
  +        "cpu": "100m",
  +        "memory": "25Mi"
  +      },
  +      "requests": {
  +        "cpu": "100m",
  +        "memory": "25Mi"
  +      }
  +    }
  +  ]
  }
  {
  -  "name": "prometheus-operator-68f7b6bd55-hmqtj",
  +  "name": "prometheus-operator-d8745bf44-l9khn",
    "resources": [
      {
        "requests": {

With that change, our nodes no longer satisfied the assumptions that
the SchedulerPreemption tests make about the schedule load on test
nodes (i.e. less than 40% of capacity is scheduled).
openshift/origin@13b6d0e4a7 (test/e2e: scheduling: disable preemption
tests, 2019-06-04, openshift/origin#23029) disabled the test, but this
change takes the alternative temporary workaround of bumping our
node capacity to re-satisfy the existing test's assumptions.

We have sufficient capacity for doubling our xlarge consumption:

  $ export AWS_PROFILE=ci
  $ aws --region us-east-1 support describe-trusted-advisor-checks --language en --query "checks[? category == 'service_limits'].{id: @.id, name: @.name}" --output text | grep 'EC2 On-Demand Instances'
  0Xc6LMYG8P   EC2 On-Demand Instances
  $ AWS_PROFILE=ci aws --region us-east-1 support describe-trusted-advisor-check-result --check-id 0Xc6LMYG8P --query "join(\`\\n\`, result.flaggedResources[].join(\`\\t\`, [@.metadata[4] || '0', @.metadata[3], @.region || '-', '0Xc6LMYG8P', @.metadata[2]]))" --output text
  91  3000  us-east-1  0Xc6LMYG8P  On-Demand instances - m4.large
  97  3000  us-east-1  0Xc6LMYG8P  On-Demand instances - m4.xlarge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants