Skip to content

Run the chaosduck component during our e2e testing.#3565

Merged
knative-prow-robot merged 1 commit into
knative:masterfrom
mattmoor:chaosduck
Jul 15, 2020
Merged

Run the chaosduck component during our e2e testing.#3565
knative-prow-robot merged 1 commit into
knative:masterfrom
mattmoor:chaosduck

Conversation

@mattmoor
Copy link
Copy Markdown
Member

No description provided.

@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 11, 2020
@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jul 11, 2020
@knative-prow-robot knative-prow-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/test-and-release Test infrastructure, tests or release labels Jul 11, 2020
@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2020
@mattmoor
Copy link
Copy Markdown
Member Author

It passed, running it again.

/test pull-knative-eventing-integration-tests

@mattmoor
Copy link
Copy Markdown
Member Author

Pretty sure this isn't running properly yet. Without a replicated webhook, I'd at least expect intermittent failures there, and looking through the logs the pods are far too old relative to the chaos duck:

chaosduck-5989f5cd88-mrb4n             1/1   Running   0     103s
eventing-controller-76cbd5d948-gcvbk   1/1   Running   0     2m38s
eventing-webhook-b66887bcd-ctnk6       1/1   Running   1     2m32s
eventing-webhook-b66887bcd-hr8x2       1/1   Running   0     2m32s
imc-controller-6c7979cfd9-ft7kz        1/1   Running   0     4s
imc-dispatcher-6596db8fcc-wk82v        1/1   Running   0     4s

@mattmoor
Copy link
Copy Markdown
Member Author

Cracking the logs I see:

knative-eventing-9utkcux231/chaosduck-5989f5cd88-mrb4n[chaosduck]: 2020/07/12 15:00:22 Ended iteration with err: pods "eventing-controller-76cbd5d948-gcvbk" is forbidden: User "system:serviceaccount:knative-eventing-9utkcux231:eventing-controller" cannot delete resource "pods" in API group "" in the namespace "knative-eventing-9utkcux231"

I just need to be less lazy about piggybacking on an existing SA and create my own with the right RBAC 🙃

@mattmoor
Copy link
Copy Markdown
Member Author

This should be fixed, let's see what breaks now 😈

@mattmoor
Copy link
Copy Markdown
Member Author

The problem now is where I'm standing things up. I saw this in serving where the wait_for_ready_pods in the serving namespace never completes because there is always a pod in terminating 😈

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 14, 2020
@mattmoor
Copy link
Copy Markdown
Member Author

Enabling the mt-broker-controller and some Brokers start failing to become ready 🤔

Rolled back the broker bit and trying the next component on my list.

This runs the following components in an HA configuration and enabled "chaosduck" on them:
 - eventing webhook
 - eventing controller
 - sugar controller

This also stubs things our for the IMC controller/dispatcher and the MT Broker, but these are disabled due to observed issues (see linked issues).
@mattmoor mattmoor changed the title [WIP] Run the chaosduck component during our e2e testing. Run the chaosduck component during our e2e testing. Jul 14, 2020
@knative-prow-robot knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 14, 2020
@mattmoor
Copy link
Copy Markdown
Member Author

/hold

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 14, 2020
@mattmoor
Copy link
Copy Markdown
Member Author

mattmoor commented Jul 15, 2020

/test pull-knative-eventing-integration-tests

(running again to flush out flakes) 😇

@mattmoor
Copy link
Copy Markdown
Member Author

One more time...

/test pull-knative-eventing-integration-tests

@mattmoor
Copy link
Copy Markdown
Member Author

This looks like the webhook shutdown failure I am chasing (here):

TestChannelNamespaceDefaulter/InMemoryChannel-messaging.knative.dev/v1: creation.go:79: Failed to create channel "e2e-defaulter-channel": Internal error occurred: failed calling webhook "webhook.eventing.knative.dev": Post https://eventing-webhook.knative-eventing-cjvz5x2e0p.svc:443/defaulting?timeout=2s: EOF

/retest

@knative-test-reporter-robot
Copy link
Copy Markdown

The following jobs failed:

Test name Triggers Retries
pull-knative-eventing-integration-tests 0/3

Failed non-flaky tests preventing automatic retry of pull-knative-eventing-integration-tests:

test/e2e.TestDefaultBrokerWithManyTriggers
test/e2e.TestDefaultBrokerWithManyTriggers/test_default_broker_with_many_attribute_and_extension_triggers

@mattmoor
Copy link
Copy Markdown
Member Author

Webhook again:

TestDefaultBrokerWithManyTriggers/test_default_broker_with_many_attribute_and_extension_triggers: creation.go:219: Failed to create v1beta1 trigger "trigger-testany-testany--extname1-extval1": Internal error occurred: failed calling webhook "validation.webhook.eventing.knative.dev": Post https://eventing-webhook.knative-eventing-kckqbzjwly.svc:443/resource-validation?timeout=2s: EOF

There is half a fix in (#3596), and I talked to @tcnghia about bumping network.DefaultDrainTimeout as well.

/retest

@vaikas
Copy link
Copy Markdown
Contributor

vaikas commented Jul 15, 2020

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 15, 2020
@knative-prow-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, vaikas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mattmoor
Copy link
Copy Markdown
Member Author

/hold cancel

If we start seeing pervasive issues, we should role this back, but the scope of this is intended to be a relatively stable subset to start flushing out more niche HA issues.

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2020
@knative-prow-robot knative-prow-robot merged commit 884ad13 into knative:master Jul 15, 2020
@mattmoor mattmoor deleted the chaosduck branch July 15, 2020 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test-and-release Test infrastructure, tests or release cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants