Bug 1752725: Log into kibana console get `504 Gateway Time-out The server didn't respond in time. ` when http_proxy enabled by nhosoi · Pull Request #255 · openshift/cluster-logging-operator

nhosoi · 2019-10-16T06:11:34Z

clusterlogging_controller.go - Adding watch for cluster proxy

Borrowed the code from cluster-network-operator/pkg/controller/proxyconfig

nhosoi · 2019-10-16T16:12:25Z

/retest

nhosoi · 2019-10-16T17:27:45Z

/test e2e-operator

ewolinetz · 2019-10-16T18:59:06Z

should that be e.Meta ?

I tried e.Meta, then it failed as follows...
clusterlogging_controller.go:61:73: e.Meta undefined (type event.UpdateEvent has no field or method Meta)

There is no e.Meta for UpdateEvent only MetaOld and MetaNew - see https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/event/event.go#L34

ewolinetz · 2019-10-16T19:00:47Z

@nhosoi have you tested what the operator's processing looks like when you stack a proxy config change and a clusterlogging change?

nhosoi · 2019-10-16T20:06:11Z

@nhosoi have you tested what the operator's processing looks like when you stack a proxy config change and a clusterlogging change?

Thanks for your reviews, @ewolinetz. Well, what I could test was adding noProxy and/or httpProxy to the cluster proxy and check the fluentd env vars if they are applied. And removed them and check them again. httpsProxy and trustedCA are not tested. (not sure how to do so...)

To be honest, with/without this PR, there's no difference in my test results...

nhosoi · 2019-10-17T00:10:47Z

This patch looks really breaking the e2e tests... :( But it's not clear to me how adding watch causes these failures...

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-logging-operator/255/pull-ci-openshift-cluster-logging-operator-master-e2e-operator/944/build-log.txt
level=error msg="Cluster operator {} {} is {} with {}: {}%!(EXTRA string=dns, v1.ClusterStatusConditionType=Degraded, v1.ConditionStatus=True, string=NotAllDNSesAvailable, string=Not all desired DNS DaemonSets available)"
level=error msg="Cluster operator {} {} is {} with {}: {}%!(EXTRA string=machine-config, v1.ClusterStatusConditionType=Degraded, v1.ConditionStatus=True, string=MachineConfigDaemonFailed, string=Failed to resync 0.0.1-2019-10-16-173124 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2))"
error: could not run steps: step e2e-operator failed: template pod "e2e-operator" failed: the pod ci-op-mibc2vhb/e2e-operator failed after 1h28m56s (failed containers: setup, test): ContainerFailed one or more containers exited

ewolinetz · 2019-10-17T16:16:46Z

@nhosoi a watch means that the operator looks for that object to change, and when it does it sends that request through the reconcile loop.

So what may end up happening is we have multiple events that are being re-reconciled (so that we can periodically update our status).

We may need to investigate a better way to check update our status periodically without holding onto events ad infinitum

nhosoi · 2019-10-17T16:46:57Z

/test e2e-operator

nhosoi · 2019-10-17T17:09:48Z

@nhosoi a watch means that the operator looks for that object to change,
and when it does it sends that request through the reconcile loop.
So what may end up happening is we have multiple events that are being
re-reconciled (so that we can periodically update our status).
We may need to investigate a better way to check update our status
periodically without holding onto events ad infinitum

I think it explains what I'm observing and wondering... Without the newly added watches, the cluster proxy config was consumed in the fluentd pod. Does that mean this cluster proxy event is reconciled by some other request and my addition is redundant??? I added 2 watches, one for the proxy configmap and the other is for the proxy object itself. Let me disable one by one and figure out what is causing this error...

nhosoi · 2019-10-17T19:04:55Z

/test e2e-operator

ewolinetz · 2019-10-17T22:15:26Z

so one thing to note with doing this, i believe when we get to Reconcile any proxy config changes will push that event - so we may fail to get the clusterlogging instance.

I'm not sure if we can do an EnqueueRequestForOwner in this case to bypass that.
But in the case where we do get a proxy event change, we don't want to requeue it at the end of a successful run.

ok - so how do other operators deal with this? It seems that other operators that need to respect proxy settings would have to deal with this (unless once again logging is the pioneer that gets the arrows . . .)

I'm learning from cluster-network-operator. The operator has separated reconciler for each watched target(?). In this PR, I piggybacked the proxyconfig watches in the clusterlogging reconciler... Do you think we have to have a separate reconciler as cluster-network-operator does???
https://github.com/openshift/cluster-network-operator/blob/master/pkg/controller/add_networkconfig.go

And by commenting out the cluster configmap watch, the e2e test passed.

cluster configmap watch

just to clarify and prevent confusion, this isn't a configmap -- its a non-namespaced object of type config

Do you think we have to have a separate reconciler as cluster-network-operator does?

I think this is the path we want to take as well, yes.

Thanks! I'm trying it and my first cut causes a strange problem. :) If I update cluster proxy spec/status, it restarts all the pods including elasticsearch and kibana... Obviously, there are lots more to learn... But I'm glad I got something to pursue.

nhosoi · 2019-10-23T19:36:51Z

Hi @ewolinetz, I'm stuck... Could you please help me?

Regarding your advice [0], I tried what I could think of (some are left in the patch commented out...) but it looks all of my attempts were invalid and updating cluster proxy affects all pods managed by cluster logging operator. Could you please give me some hints for "updating the proxy reconciler to just update the collector work"?

[0]

you can update your proxy reconciler to just update the collector work... likely something is being updated (maybe a deployment changed? the operator logs should state what happened)

What I'm observing is (the following is a snippet of debug prints [1] I put into the patch [2]) Reconcile for 'instance' is called about every 30 sec. and updating the status of each pod. Of course, it does not restart the pods. When I update the cluster proxy's spec (in my testing noProxy value), then all the pods are restarted. It looks to me it was derived from reconciling 'instance' since there's no k8shandler.Reconcile call in Reconcile in proxyconfig_controller.go...

And one more thing I'm confused is without this PR/attempt - adding Watch for cluster proxy, changes made in cluster proxy status is applied to fluentd EnvVar. That is, it looks to me we don't need the Watch for cluster proxy and this PR is introducing something redundant. But I should be wrong...

[1]
..............
time="2019-10-23T18:24:59Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:01Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:20Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:20Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:39Z" level=info msg="DBG_PX: Proxyconfig reconcile request.Name: 'cluster'"
time="2019-10-23T18:25:39Z" level=info msg="DBG_PX: Proxyconfig reconcile request.Name: 'cluster'"
time="2019-10-23T18:25:50Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:26:21Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:26:21Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:26:55Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
..............

[2]
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-7e30450910148948e030df8d7b99f153R121
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-993a3a5d79e9b1a8103e2bde12087a84R132

ewolinetz · 2019-10-24T16:36:09Z

i think we should leave this function definition the same and instead have a separate call for the proxy config reconciler to just adjust the collector

ewolinetz · 2019-10-24T16:36:57Z

this feels very hacky

nhosoi · 2019-10-31T15:00:47Z

/test e2e-operator

nhosoi · 2019-11-15T00:41:46Z

Hi @bparees, do you think it's ok to squash the patches and lift /hold? Thanks!

wking · 2019-11-15T04:23:16Z

I haven't read through this yet, but I just floated openshift/enhancements#115 today. Does what you have here square with that?

nhosoi · 2019-11-15T17:51:34Z

Thanks for your input, @wking.

I haven't read through this yet, but I just floated openshift/enhancements#115 today. Does what you have here square with that?

In this conversation: https://github.com/openshift/enhancements/pull/115/files#r346660985,
components consume that CM in whatever manner they choose. They may use it as the only set of CAs, they might union it with their own system trusts, they might union it with their own other sources of CAs, etc.components consume that CM in whatever manner they choose. They may use it as the only set of CAs, they might union it with their own system trusts, they might union it with their own other sources of CAs, etc. ,

I think in this PR, it "uses it as the only set of CAs", i.e., we are overriding the existing /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem with the injected bundle in the CM in the fluentd as well as the kibana pods. @bparees, @jcantrill, @ewolinetz, does it break some existing features? Should we choose other options, e.g., "union it with their own system trusts" or "union it with their own other sources of CAs"? Thanks.

bparees · 2019-11-15T18:44:26Z

Should we choose other options, e.g., "union it with their own system trusts"

you don't need to union it with your own system trusts because it already includes the UBI system trusts and i wouldn't expect your system trusts to be any different.

or "union it with their own other sources of CAs"? Thanks.

does logging have a user configurable source of CAs? If so you should be unioning with it. If not, then the question is moot.

ewolinetz · 2019-11-18T17:09:57Z

i'm wondering if we should actually move these from this controller to a different location like the types file since we use at least some of these values in multiple places...
cc @jcantrill

i'm wondering if we should actually move these from this controller to a different location like the types file since we use at least some of these values in multiple places...
cc @jcantrill

Where such types.go is supposed to locate? Something like one of these?
cluster-logging-operator/types/types.go
or
cluster-logging-operator/pkg/types/types.go

I created pkg/k8shandler/constants.go in this PR. I assume the file should be merged into the new "types.go", as well...

sorry, i meant this file https://github.com/openshift/cluster-logging-operator/blob/master/pkg/apis/logging/v1/clusterlogging_types.go

they are constants so maybe something more obvious like cluster-logging-operator/pkg/constants/constants.go It's own package structure may be the only way to avoid cyclic import

ewolinetz · 2019-11-18T17:14:34Z

This debug doesn't really get us much since we require the proxy name to be an expected one... if we keep this can we change it to be something like Reconciling for a proxy config event ?

Ok. Removed.

ewolinetz · 2019-11-18T17:16:03Z

I'm not sure this should be an error... also if the proxy config doesn't exist shouldn't we update our components to no longer have a proxy config?

I'm not sure this should be an error... also if the proxy config doesn't exist shouldn't we update our components to no longer have a proxy config?

Hmm, I thought if the cluster is normal, the global proxy object always exists. But there's a case it does not?

does it? I thought the proxy object is only there if any proxy configurations are specified during install.. @bparees can you comment on that?

ewolinetz · 2019-11-18T17:18:35Z

the snippet below doesn't requeue, and in the case of an error i don't think we should... lets remove this comment

we shouldn't be requeueing...

Thanks for the confirmation! (I thought we had to from the conversation...) Let me fix it.

ewolinetz · 2019-11-18T17:18:46Z

remove this comment

ewolinetz · 2019-11-18T17:19:34Z

if we don't have a trust bundle anymore we should likely remove it from our components as well... (likely not an error message)

if we don't have a trust bundle anymore we should likely remove it from our components as well... (likely not an error message)

Currently, it is not removed when we see the trusted CA bundle is empty. We calculate the hash and set it to the daemonset/deployment annotation. When it's empty we keep the configmap with empty trusted CA bundle and set "0" to each annotation.
(please also note that it contains all the system trusted CA, so I'm wonder if there's the chance...)

Apart from it, I agree with the IsNotFound is not an error. I'm removing the clause.

ewolinetz · 2019-11-18T17:23:50Z

i think we do want to requeue here... if we fail during us trying to process something we should try to ensure it sticks

ewolinetz · 2019-11-18T17:25:02Z

instead can we leverage the NewConfigMap function and then just add labels to the object metadata?

ewolinetz · 2019-11-18T17:27:25Z

lets consolidate this into a single var -- no reason to have two different constants with the same value. fluentdTrustedCAName ?

ewolinetz · 2019-11-18T17:39:21Z

update to be single const var?

ewolinetz · 2019-11-18T17:40:51Z

shouldn't we already have this proxy object before in the call stack? It seems like we can save a k8s call by passing that object through...

shouldn't we already have this proxy object before in the call stack? It seems like we can save a k8s call by passing that object through...

You mean we should share the proxyConfig object (and the trustedCABundleCM) between kibana and fluentd? Is it possible to stash the object somewhere and refer them from kibana and fluentd? If so, where is the good candidate to stash the objects? For instance, ClusterLoggingRequest?

i'm not sure we need to stash it, but can't we pass it along from the controller?

nhosoi · 2019-11-18T23:33:18Z

/retest

ewolinetz · 2019-11-20T15:22:32Z

@bparees were your requested changes addressed?
otherwise lgtm

bparees · 2019-11-20T21:33:46Z

it looks like this is trying to determine if the reconcile request was triggered by a a change to a configmap that logging cares about, but i'm not sure how this works?

The above code is "Outdated".
Now we are checking the request.Name is in ReconcileForGlobalProxyList, which is {"fluentd-trusted-ca-bundle", "kibana-trusted-ca-bundle"}
} else if utils.ContainsString(constants.ReconcileForGlobalProxyList, request.Name) {
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-993a3a5d79e9b1a8103e2bde12087a84R108

bparees · 2019-11-20T21:34:08Z

is this watching all configmaps in the entire cluster? does operatorSDK not give us a way to scope the watch to the logging namespace?

I assume your question is about ConfigMap at the line 66 (not Proxy). Now ConfigMap is watched if it is in the openshift-logging namespace and the name is fluentd-trusted-ca-bundle or kibana-trusted-ca-bundle.

nhosoi · 2019-11-21T01:11:04Z

@ewolinetz, @bparees, thank you very much for your reviews.
I squashed the patches. Hopefully, they fix the issues you pointed out...

nhosoi · 2019-11-21T05:26:33Z

/test e2e-operator

richm · 2019-11-21T17:14:23Z

Please review. We need to merge this by tomorrow, or it won't happen for a couple of weeks. Let's try to get this merged ASAP.

bparees · 2019-11-21T19:21:44Z

@richm @nhosoi @ewolinetz my concerns around the event handling and deployment rollout triggering have been addressed.. lgtm.

ewolinetz · 2019-11-21T19:29:21Z

after https://github.com/openshift/cluster-logging-operator/pull/255/files#r348753401 i'll put a flag on this... @nhosoi

nhosoi · 2019-11-21T19:58:02Z

after https://github.com/openshift/cluster-logging-operator/pull/255/files#r348753401 i'll put a flag on this... @nhosoi

Thanks, @ewolinetz, @bparees!!
I love this auto commit feature in github! but still i'd better squash into one pr...

- Adding proxyconfig controller to watch cluster proxy and trusted CA bundle configmaps in the openshift-logging namespace. These configmap name is KibanaTrustedCAName and FluentdTrustedCAName. - Adding pkg/constants/constants.go to share the constant strings. - Simplifying settng proxy environment variables to EnvVar. - Adding trusted CA bundle configmap support. The configmap is being watched in the proxyconfig controller. Fluentd daemonset and kibana deployment hold the hash value of ca certs in their annotation. The value is updated if the ca certs in the configmap are updated, which triggers the fluentd and kibana pods restart and update the mounted tls-ca-bundle.pem file. It overrides /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem with the certs auto-filled in configmap by "VolumeMount'ing". utils.EnvVarEqual: - In EnvVarSourceEqual, replacing reflect.DeepEqual with customized EnvVarResourceFieldSelectorEqual since Divisor (type resource.Quantity) is not correctly compared by DeepEqual. Others: hack/common - Keeping debug_print for future debugging. This PR fixes the following 3 bugs. Bug 1752725 - Log into kibana console get `504 Gateway Time-out The server didn't respond in time. ` when http_proxy enabled Bug 1766187 - Authentication "500 Internal Error"' Bug 1768762 - Fluentd: "Could not communicate to Elasticsearch" when http proxy enabled in the cluster. Fix: Setting the elasticsearch FQDN to logStoreService and elasticsearchName. The FQDN belongs to the global proxy noProxy list. By doing so, it skips the global proxy to communicate with the internal elasticsearch. Bug 1774837 - Too many `warning: The environment variable HTTP_PROXY is discouraged. Use http_proxy.` in fluentd pod logs after enable forwarding logs to user-managed ES as insecure Fix: In addition to HTTP_PROXY, HTTPS_PROXY and NO_PROXY, setting http_proxy, https_proxy and no_proxy, as well.

ewolinetz · 2019-11-21T21:31:51Z

/lgtm

openshift-ci-robot · 2019-11-21T21:32:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ewolinetz, nhosoi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ewolinetz,nhosoi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-11-21T21:33:15Z

@nhosoi: All pull requests linked via external trackers have merged. Bugzilla bug 1752725 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1752725: Log into kibana console get 504 Gateway Time-out The server didn't respond in time. when http_proxy enabled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 16, 2019

openshift-ci-robot requested review from jcantrill and lukas-vlcek October 16, 2019 06:12

ewolinetz reviewed Oct 16, 2019

View reviewed changes

nhosoi added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2019

nhosoi force-pushed the log-464 branch from a99513c to 955891b Compare October 17, 2019 21:16

ewolinetz reviewed Oct 17, 2019

View reviewed changes

nhosoi force-pushed the log-464 branch from 955891b to d2b2385 Compare October 23, 2019 00:08

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 23, 2019

nhosoi force-pushed the log-464 branch 2 times, most recently from 0da021f to b5bbf8c Compare October 23, 2019 00:41

ewolinetz reviewed Oct 24, 2019

View reviewed changes

Comment thread pkg/controller/proxyconfig/proxyconfig_controller.go Outdated

ewolinetz reviewed Oct 24, 2019

View reviewed changes

Comment thread pkg/k8shandler/reconciler.go Outdated

Copy link
Copy Markdown

Contributor

ewolinetz Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels very hacky

nhosoi reacted with thumbs up emoji

nhosoi force-pushed the log-464 branch 2 times, most recently from 850178c to b39f488 Compare October 31, 2019 00:45

nhosoi removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2019

ewolinetz reviewed Nov 18, 2019

View reviewed changes

Comment thread pkg/k8shandler/visualization.go Outdated

ewolinetz approved these changes Nov 20, 2019

View reviewed changes

bparees reviewed Nov 20, 2019

View reviewed changes

Conversation

nhosoi commented Oct 16, 2019

Uh oh!

nhosoi commented Oct 16, 2019

Uh oh!

nhosoi commented Oct 16, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ewolinetz commented Oct 16, 2019

Uh oh!

nhosoi commented Oct 16, 2019

Uh oh!

nhosoi commented Oct 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ewolinetz commented Oct 17, 2019

Uh oh!

nhosoi commented Oct 17, 2019

Uh oh!

nhosoi commented Oct 17, 2019

Uh oh!

nhosoi commented Oct 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ewolinetz Oct 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nhosoi commented Oct 23, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nhosoi commented Oct 31, 2019

Uh oh!

nhosoi commented Nov 15, 2019

Uh oh!

wking commented Nov 15, 2019

Uh oh!

nhosoi commented Nov 15, 2019

Uh oh!

bparees commented Nov 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcantrill Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nhosoi commented Oct 17, 2019 •

edited

Loading

ewolinetz Oct 18, 2019 •

edited

Loading

jcantrill Nov 19, 2019 •

edited

Loading