Bug 1752725: Log into kibana console get 504 Gateway Time-out The server didn't respond in time. when http_proxy enabled#255
Conversation
|
/retest |
|
/test e2e-operator |
There was a problem hiding this comment.
I tried e.Meta, then it failed as follows...
clusterlogging_controller.go:61:73: e.Meta undefined (type event.UpdateEvent has no field or method Meta)
There was a problem hiding this comment.
There is no e.Meta for UpdateEvent only MetaOld and MetaNew - see https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/event/event.go#L34
|
@nhosoi have you tested what the operator's processing looks like when you stack a proxy config change and a clusterlogging change? |
Thanks for your reviews, @ewolinetz. Well, what I could test was adding noProxy and/or httpProxy to the cluster proxy and check the fluentd env vars if they are applied. And removed them and check them again. httpsProxy and trustedCA are not tested. (not sure how to do so...) To be honest, with/without this PR, there's no difference in my test results... |
|
This patch looks really breaking the e2e tests... :( But it's not clear to me how adding watch causes these failures... |
|
@nhosoi a watch means that the operator looks for that object to change, and when it does it sends that request through the reconcile loop. So what may end up happening is we have multiple events that are being re-reconciled (so that we can periodically update our status). We may need to investigate a better way to check update our status periodically without holding onto events ad infinitum |
|
/test e2e-operator |
I think it explains what I'm observing and wondering... Without the newly added watches, the cluster proxy config was consumed in the fluentd pod. Does that mean this cluster proxy event is reconciled by some other request and my addition is redundant??? I added 2 watches, one for the proxy configmap and the other is for the proxy object itself. Let me disable one by one and figure out what is causing this error... |
|
/test e2e-operator |
There was a problem hiding this comment.
so one thing to note with doing this, i believe when we get to Reconcile any proxy config changes will push that event - so we may fail to get the clusterlogging instance.
I'm not sure if we can do an EnqueueRequestForOwner in this case to bypass that.
But in the case where we do get a proxy event change, we don't want to requeue it at the end of a successful run.
There was a problem hiding this comment.
ok - so how do other operators deal with this? It seems that other operators that need to respect proxy settings would have to deal with this (unless once again logging is the pioneer that gets the arrows . . .)
There was a problem hiding this comment.
I'm learning from cluster-network-operator. The operator has separated reconciler for each watched target(?). In this PR, I piggybacked the proxyconfig watches in the clusterlogging reconciler... Do you think we have to have a separate reconciler as cluster-network-operator does???
https://github.com/openshift/cluster-network-operator/blob/master/pkg/controller/add_networkconfig.go
There was a problem hiding this comment.
And by commenting out the cluster configmap watch, the e2e test passed.
There was a problem hiding this comment.
cluster configmap watch
just to clarify and prevent confusion, this isn't a configmap -- its a non-namespaced object of type config
There was a problem hiding this comment.
Do you think we have to have a separate reconciler as cluster-network-operator does?
I think this is the path we want to take as well, yes.
There was a problem hiding this comment.
Thanks! I'm trying it and my first cut causes a strange problem. :) If I update cluster proxy spec/status, it restarts all the pods including elasticsearch and kibana... Obviously, there are lots more to learn... But I'm glad I got something to pursue.
0da021f to
b5bbf8c
Compare
|
Hi @ewolinetz, I'm stuck... Could you please help me? Regarding your advice [0], I tried what I could think of (some are left in the patch commented out...) but it looks all of my attempts were invalid and updating cluster proxy affects all pods managed by cluster logging operator. Could you please give me some hints for "updating the proxy reconciler to just update the collector work"? [0]
What I'm observing is (the following is a snippet of debug prints [1] I put into the patch [2]) Reconcile for 'instance' is called about every 30 sec. and updating the status of each pod. Of course, it does not restart the pods. When I update the cluster proxy's spec (in my testing noProxy value), then all the pods are restarted. It looks to me it was derived from reconciling 'instance' since there's no k8shandler.Reconcile call in Reconcile in proxyconfig_controller.go... And one more thing I'm confused is without this PR/attempt - adding Watch for cluster proxy, changes made in cluster proxy status is applied to fluentd EnvVar. That is, it looks to me we don't need the Watch for cluster proxy and this PR is introducing something redundant. But I should be wrong... [1] [2] |
There was a problem hiding this comment.
i think we should leave this function definition the same and instead have a separate call for the proxy config reconciler to just adjust the collector
850178c to
b39f488
Compare
|
/test e2e-operator |
|
Hi @bparees, do you think it's ok to squash the patches and lift /hold? Thanks! |
|
I haven't read through this yet, but I just floated openshift/enhancements#115 today. Does what you have here square with that? |
|
Thanks for your input, @wking.
In this conversation: https://github.com/openshift/enhancements/pull/115/files#r346660985, I think in this PR, it "uses it as the only set of CAs", i.e., we are overriding the existing |
you don't need to union it with your own system trusts because it already includes the UBI system trusts and i wouldn't expect your system trusts to be any different.
does logging have a user configurable source of CAs? If so you should be unioning with it. If not, then the question is moot. |
There was a problem hiding this comment.
i'm wondering if we should actually move these from this controller to a different location like the types file since we use at least some of these values in multiple places...
cc @jcantrill
There was a problem hiding this comment.
i'm wondering if we should actually move these from this controller to a different location like the types file since we use at least some of these values in multiple places...
cc @jcantrill
Where such types.go is supposed to locate? Something like one of these?
cluster-logging-operator/types/types.go
or
cluster-logging-operator/pkg/types/types.go
I created pkg/k8shandler/constants.go in this PR. I assume the file should be merged into the new "types.go", as well...
There was a problem hiding this comment.
There was a problem hiding this comment.
they are constants so maybe something more obvious like cluster-logging-operator/pkg/constants/constants.go It's own package structure may be the only way to avoid cyclic import
There was a problem hiding this comment.
This debug doesn't really get us much since we require the proxy name to be an expected one... if we keep this can we change it to be something like Reconciling for a proxy config event ?
There was a problem hiding this comment.
I'm not sure this should be an error... also if the proxy config doesn't exist shouldn't we update our components to no longer have a proxy config?
There was a problem hiding this comment.
I'm not sure this should be an error... also if the proxy config doesn't exist shouldn't we update our components to no longer have a proxy config?
Hmm, I thought if the cluster is normal, the global proxy object always exists. But there's a case it does not?
There was a problem hiding this comment.
does it? I thought the proxy object is only there if any proxy configurations are specified during install.. @bparees can you comment on that?
There was a problem hiding this comment.
the snippet below doesn't requeue, and in the case of an error i don't think we should... lets remove this comment
There was a problem hiding this comment.
we shouldn't be requeueing...
There was a problem hiding this comment.
Thanks for the confirmation! (I thought we had to from the conversation...) Let me fix it.
There was a problem hiding this comment.
if we don't have a trust bundle anymore we should likely remove it from our components as well... (likely not an error message)
There was a problem hiding this comment.
if we don't have a trust bundle anymore we should likely remove it from our components as well... (likely not an error message)
Currently, it is not removed when we see the trusted CA bundle is empty. We calculate the hash and set it to the daemonset/deployment annotation. When it's empty we keep the configmap with empty trusted CA bundle and set "0" to each annotation.
(please also note that it contains all the system trusted CA, so I'm wonder if there's the chance...)
Apart from it, I agree with the IsNotFound is not an error. I'm removing the clause.
There was a problem hiding this comment.
i think we do want to requeue here... if we fail during us trying to process something we should try to ensure it sticks
There was a problem hiding this comment.
instead can we leverage the NewConfigMap function and then just add labels to the object metadata?
There was a problem hiding this comment.
lets consolidate this into a single var -- no reason to have two different constants with the same value. fluentdTrustedCAName ?
There was a problem hiding this comment.
update to be single const var?
There was a problem hiding this comment.
shouldn't we already have this proxy object before in the call stack? It seems like we can save a k8s call by passing that object through...
There was a problem hiding this comment.
shouldn't we already have this proxy object before in the call stack? It seems like we can save a k8s call by passing that object through...
You mean we should share the proxyConfig object (and the trustedCABundleCM) between kibana and fluentd? Is it possible to stash the object somewhere and refer them from kibana and fluentd? If so, where is the good candidate to stash the objects? For instance, ClusterLoggingRequest?
There was a problem hiding this comment.
i'm not sure we need to stash it, but can't we pass it along from the controller?
|
/retest |
|
@bparees were your requested changes addressed? |
There was a problem hiding this comment.
it looks like this is trying to determine if the reconcile request was triggered by a a change to a configmap that logging cares about, but i'm not sure how this works?
There was a problem hiding this comment.
The above code is "Outdated".
Now we are checking the request.Name is in ReconcileForGlobalProxyList, which is {"fluentd-trusted-ca-bundle", "kibana-trusted-ca-bundle"}
} else if utils.ContainsString(constants.ReconcileForGlobalProxyList, request.Name) {
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-993a3a5d79e9b1a8103e2bde12087a84R108
There was a problem hiding this comment.
is this watching all configmaps in the entire cluster? does operatorSDK not give us a way to scope the watch to the logging namespace?
There was a problem hiding this comment.
I assume your question is about ConfigMap at the line 66 (not Proxy). Now ConfigMap is watched if it is in the openshift-logging namespace and the name is fluentd-trusted-ca-bundle or kibana-trusted-ca-bundle.
|
@ewolinetz, @bparees, thank you very much for your reviews. |
|
/test e2e-operator |
|
Please review. We need to merge this by tomorrow, or it won't happen for a couple of weeks. Let's try to get this merged ASAP. |
|
@richm @nhosoi @ewolinetz my concerns around the event handling and deployment rollout triggering have been addressed.. lgtm. |
|
after https://github.com/openshift/cluster-logging-operator/pull/255/files#r348753401 i'll put a flag on this... @nhosoi |
Thanks, @ewolinetz, @bparees!! |
- Adding proxyconfig controller to watch cluster proxy and trusted CA
bundle configmaps in the openshift-logging namespace. These configmap
name is KibanaTrustedCAName and FluentdTrustedCAName.
- Adding pkg/constants/constants.go to share the constant strings.
- Simplifying settng proxy environment variables to EnvVar.
- Adding trusted CA bundle configmap support.
The configmap is being watched in the proxyconfig controller.
Fluentd daemonset and kibana deployment hold the hash value of ca
certs in their annotation. The value is updated if the ca certs
in the configmap are updated, which triggers the fluentd and kibana
pods restart and update the mounted tls-ca-bundle.pem file.
It overrides /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem with
the certs auto-filled in configmap by "VolumeMount'ing".
utils.EnvVarEqual:
- In EnvVarSourceEqual, replacing reflect.DeepEqual with customized
EnvVarResourceFieldSelectorEqual since Divisor (type resource.Quantity)
is not correctly compared by DeepEqual.
Others:
hack/common - Keeping debug_print for future debugging.
This PR fixes the following 3 bugs.
Bug 1752725 - Log into kibana console get `504 Gateway Time-out The
server didn't respond in time. ` when http_proxy enabled
Bug 1766187 - Authentication "500 Internal Error"'
Bug 1768762 - Fluentd: "Could not communicate to Elasticsearch" when
http proxy enabled in the cluster.
Fix: Setting the elasticsearch FQDN to logStoreService
and elasticsearchName. The FQDN belongs to the global
proxy noProxy list. By doing so, it skips the global
proxy to communicate with the internal elasticsearch.
Bug 1774837 - Too many `warning: The environment variable HTTP_PROXY is
discouraged. Use http_proxy.` in fluentd pod logs after
enable forwarding logs to user-managed ES as insecure
Fix: In addition to HTTP_PROXY, HTTPS_PROXY and NO_PROXY,
setting http_proxy, https_proxy and no_proxy, as well.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ewolinetz, nhosoi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@nhosoi: All pull requests linked via external trackers have merged. Bugzilla bug 1752725 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
clusterlogging_controller.go - Adding watch for cluster proxy
Borrowed the code from cluster-network-operator/pkg/controller/proxyconfig