Structured logging for route controller#888
Structured logging for route controller#888google-prow-robot merged 19 commits intoknative:masterfrom mdemirhan:zap
Conversation
…onfiguration in case of bad configurations.
|
/retest |
| if err := json.Unmarshal([]byte(zapJSON), &loggingCfg); err != nil { | ||
| // Failed to parse the logging configuration. Fall back to production config | ||
| loggingCfg = zap.NewProductionConfig() | ||
| logInitFailure = fmt.Sprintf("Error parsing logging configuration. Using default config: %v", err) |
There was a problem hiding this comment.
This is a total nit, but the way I'd read this log message would be that the default config would be printed after the "Using default config: " but we print the error there. I can't think off the top of my head how to make it more readable, so unless somebody can make a suggestion, it's fine.
There was a problem hiding this comment.
I will try to clarify this. Thanks for pointing this out!
| logger.Infof("Using fluentd sidecar image: %s", loggingFluentSidecarImage) | ||
| } else { | ||
| glog.Fatal("missing required flag: -fluentdSidecarImage") | ||
| logger.Fatalf("missing required flag: -fluentdSidecarImage") |
| } | ||
|
|
||
| func (c *Controller) reconcileIngress(route *v1alpha1.Route) error { | ||
| func (c *Controller) reconcileIngress(route *v1alpha1.Route, logger *zap.SugaredLogger) error { |
There was a problem hiding this comment.
In some CF codebases there's a pattern used for tracing/debugging that looks like:
func (c *Controller) reconcileIngress(logger *zap.SugaredLogger, ...) error {
logger = logger.Named("reconcileIngress") // this hints at the call path
logger.Debug("start")
defer logger.Debug("end")
...
}
If loggers are named properly you'll more or less have a call path, which is especially useful if you don't have file line numbers. The debug log statements help detect any long blocking calls/deadlocks without needed to instrument signal handling to dump the list of goroutines.
There was a problem hiding this comment.
I like this pattern quite a bit. We have a best practices document and I will add this there. I am not an expert in the controller base and I would be hesitant to add my own instrumentation here and I would rather have code owners make such changes.
| func (c *Controller) syncTrafficTargetsAndUpdateRouteStatus(route *v1alpha1.Route) (*v1alpha1.Route, error) { | ||
| c.consolidateTrafficTargets(route) | ||
| configMap, revMap, err := c.getDirectTrafficTargets(route) | ||
| func (c *Controller) syncTrafficTargetsAndUpdateRouteStatus(route *v1alpha1.Route, logger *zap.SugaredLogger) (*v1alpha1.Route, error) { |
There was a problem hiding this comment.
Thoughts on having the logger always be the first argument?
There was a problem hiding this comment.
@grantr and @jonjohnsonjr WDYT? You are the main owners of these files and if it doesn't bother anyone, I can more logger to be the first argument.
There was a problem hiding this comment.
I agree with @dprotaso - I think I'd prefer this to be somehow embedded in a context.Context and made the first argument. It seems like we'll want to add context to everything, anyway, and this fits that pattern pretty well...
There was a problem hiding this comment.
If we're adding arguments everywhere, I'd also prefer to use a context.Context, but this makes me wonder: Istio already uses zap; how does it deal with log context?
There was a problem hiding this comment.
@dprotaso @jonjohnsonjr @grantr - I converted from passing logger around to passing context around as much as possible. Please take a look and let me know how this one looks.
| logger.Info("Creating/Updating placeholder k8s services") | ||
|
|
||
| if err = c.reconcilePlaceholderService(route); err != nil { | ||
| if err = c.reconcilePlaceholderService(route, logger); err != nil { |
There was a problem hiding this comment.
Curious if there's any external calls which would benefit from having a context.Context with a timeout. If so passing around a context with a nested the logger as a value would change these function definitions.
See my discussion here: #347 (comment)
There was a problem hiding this comment.
There are external calls to k8s APIs and but as far as I know, they don't take context as an input.
| func main() { | ||
| flag.Parse() | ||
|
|
||
| zapJSON := loggingZapCfg.Get() |
There was a problem hiding this comment.
This is a lot of boilerplate -- is it possible to move it into pkg/logging?
|
|
||
| // We will use sugared logger over the raw logger because it provides a | ||
| // better balance between performance and ease of use. | ||
| // For performance critical parts, once can call into "Desugar" method and |
There was a problem hiding this comment.
A bit unrelated, but I'm skeptical that we have any performance critical parts that would be materially affected by logging perf.
There was a problem hiding this comment.
I agree. It can be useful in the date plane but I don't expect any of our control plane code needing this. That being said, I wanted to add the comment just in case we need it in the future for some reason.
There was a problem hiding this comment.
Performance critical situations:
- Scaling up or down. Any controller in the scale path must be low latency.
- High revision cardinality. When informers have thousands of revisions to keep track of, memory pressure and GC pauses may be more of an issue.
The sugared logger is probably fine for other situations. This isn't intended to be a recommendation for this PR, but I prefer unsugared logging interface based on personal experience with Golang structured logging:
- It preserves type-safety at the expense of slightly greater verbosity. Several logging bugs were caught by the compiler that would have been runtime bugs otherwise.
- Debuggability suffers when developers are afraid to log for fear of impacting performance. The implementation of leveled logging in particular should be as close to zero cost as possible for logs below the current level.
|
|
||
| // Start the informer factories to begin populating the informer caches | ||
| glog.Info("Starting Route controller") | ||
| c.logger.Info("Starting Route controller") |
There was a problem hiding this comment.
It seems strange to access the logger like this in some places and via the context in others. I think ideally we would pass ctx through everything, but IDK if it's worth worrying about for this PR.
There was a problem hiding this comment.
We cannot change the signatures of methods called by k8s (this method + addConfigurationEvent, updateConfigurationEvent and updateIngressEvent). That is why these places start with the controller wide logger and then they need to rebuild their logger with proper route name and namespace. A bit unfortunate, but I couldn't come up with something that will look better.
There was a problem hiding this comment.
Ah, I see. My mistake.
| } | ||
|
|
||
| logger := loggerWithRouteInfo(c.logger, namespace, name) | ||
| ctx := logging.WithLogger(context.Background(), logger) |
There was a problem hiding this comment.
Perhaps context.TODO() instead? (in most places)
What does this mean - is the output really noisy? |
|
@jonjohnsonjr The output is going to be as noisy as it is today - no change there. We need to cleanup a lot of our logging which seems either useless or too verbose :) The reason I said kubectl will suck for this is the shape of the output: by default, it will encode the logs in JSON - not great for human consumption. For development though, one can change the config and set the encoding to "console" rather than "JSON" and they will get a more readable output, but it will still encode key value pairs as JSON. |
|
@dprotaso @jonjohnsonjr @grantr @vaikas-google Any concerns with this PR that you think is still unresolved? If not, I would like to check this in and start converting the rest of the components to use this model. |
# Conflicts: # pkg/controller/route/route.go
grantr
left a comment
There was a problem hiding this comment.
Great work @mdemirhan! This is a nice improvement. I have a few recommendations and one requested change (returning a noop logger instead of nil from FromContext).
I noticed most of the logging calls here still use interpolated messages instead of structured logging. Do you intend to convert those in a later PR?
Context might seem cumbersome now, but I think it will pay dividends in the future. Examples of other ways it could be useful:
- Set a deadline on a reconcile iteration.
- Propagate opencensus tags to reconcile methods, allowing things like identifying which routes take longer to reconcile.
- Reconcile tracing. I don't know if this is possible, but what if we could generate a trace based on the generation or resourceVersion of a resource that shows everything that happened when that resource was updated?
| "errorOutputPaths": ["stderr"], | ||
| "encoding": "json", | ||
| "encoderConfig": { | ||
| "timeKey": "", |
There was a problem hiding this comment.
Why is timeKey empty? Do we not log timestamps?
There was a problem hiding this comment.
Kubernetes logging driver already adds a timestamp for logs collected from stdout and stderr. That is why we are omitting it here and preventing dual timestamps in the logs.
|
|
||
| // We will use sugared logger over the raw logger because it provides a | ||
| // better balance between performance and ease of use. | ||
| // For performance critical parts, once can call into "Desugar" method and |
There was a problem hiding this comment.
Performance critical situations:
- Scaling up or down. Any controller in the scale path must be low latency.
- High revision cardinality. When informers have thousands of revisions to keep track of, memory pressure and GC pauses may be more of an issue.
The sugared logger is probably fine for other situations. This isn't intended to be a recommendation for this PR, but I prefer unsugared logging interface based on personal experience with Golang structured logging:
- It preserves type-safety at the expense of slightly greater verbosity. Several logging bugs were caught by the compiler that would have been runtime bugs otherwise.
- Debuggability suffers when developers are afraid to log for fear of impacting performance. The implementation of leveled logging in particular should be as close to zero cost as possible for logs below the current level.
|
|
||
| func main() { | ||
| flag.Parse() | ||
| logger := logging.NewLogger(loggingZapCfg.Get()) |
There was a problem hiding this comment.
Possible enhancement: name this logger with the name of the binary:
logger := logging.NewLogger(loggingZapCfg.Get()).Named("ela-controller")That gives the user a way to search for all messages generated by this binary (which includes multiple controllers).
|
|
||
| glog.Infof("Route controller Init") | ||
| // Enrich the logs with controller type set to route | ||
| logger = logger.With(zap.String(logging.LogKeyControllerType, "route")) |
There was a problem hiding this comment.
Possible enhancement: Also add route to the logger name, for example:
logger = logger.Named("route").With(zap.String(logging.LogKeyControllerType, "route"))| return err | ||
| } | ||
| glog.Infof("Created service: %q", service.Name) | ||
| logging.FromContext(ctx).Infof("Created service %s", service.Name) |
There was a problem hiding this comment.
Why not set logger to a variable in this method, since there are two log calls?
There was a problem hiding this comment.
Didn't want to increase the code size and I had a random threshold of 3 logger calls to assign a variable :) I will change these.
| logger := rawLogger.Sugar() | ||
| if len(logInitFailure) > 0 { | ||
| logger.Error(logInitFailure) | ||
| } |
There was a problem hiding this comment.
Another way to structure this would be to split the config parsing into a separate function that returns (zap.Config, error). Then there's no need to check for emptiness of the logInitFailure string.
func NewLogger(configJSON string) *zap.SugaredLogger {
var logger zap.Logger
loggingCfg, err := parseConfig(configJSON)
if err != nil {
loggingCfg = zap.NewProductionConfig()
logger = loggingCfg.Build()
logger.Error("Failed to parse the logging config. Will use default config.", zap.Error(err))
} else {
logger = loggingCfg.Build()
}
return logger.Sugar()
}There was a problem hiding this comment.
Good suggestion. I will fix that.
| limitations under the License. | ||
| */ | ||
|
|
||
| package logging |
There was a problem hiding this comment.
What do you think about moving these to the elafros/logging/logkey package? That gives the constant names a cleaner namespace. I think brevity will become more important as we update the logger calls to emit fields instead of interpolated messages.
// this
c.logger.Error("Error fetching route upon ingress creation",
zap.Error(err), zap.String(logging.LogKeyNamespace, ns), zap.String(logging.LogKeyRoute, routeName))
// versus this
c.logger.Error("Error fetching route upon ingress creation",
zap.Error(err), zap.String(logkey.Namespace, ns), zap.String(logkey.Route, routeName))There was a problem hiding this comment.
Makes sense. I will move to logkey package.
| "go.uber.org/zap" | ||
| ) | ||
|
|
||
| type loggerKey struct{} |
There was a problem hiding this comment.
Thanks for doing it this way! Using a string as key is a common mistake with context values.
| if logger, ok := ctx.Value(loggerKey{}).(*zap.SugaredLogger); ok { | ||
| return logger | ||
| } | ||
| return nil |
There was a problem hiding this comment.
This should return logger.NewNop(). Otherwise if logger is missing from context, callers will panic when they try to call logger methods on nil.
There was a problem hiding this comment.
I was doing this intentionally so that invalid contexts are caught during development. I will change this to what you describe though.
There was a problem hiding this comment.
That makes sense! What about this compromise: instead of silently returning a noop logger that will lose potentially valuable log messages, this package can create a package-specific fallback logger, named as such, and return that when logger isn't present. That way no log messages are lost due to incorrect contexts, but it's still detectable when that happens (because messages are tagged with logger=fallback).
There was a problem hiding this comment.
Sounds great. I will send an update shortly.
| if got := FromContext(ctx); want != got { | ||
| t.Errorf("unexpected logger in context. want: %v, got: %v", want, got) | ||
| } | ||
| } |
There was a problem hiding this comment.
Add a test for a context that doesn't have a logger value.
There was a problem hiding this comment.
Seems like there is already a test covering this. Line 40.
|
@grantr Thanks for the detailed review. Regarding interpolated messages: I was planning to open GitHub issues and assign to owners of the code to handle fully structuring the logs. My change only adds the basic structure (controller name, route name, namespace). I myself do not have the full context on all the code and might not make the right call. Also, having everything fully structured vs interpolated messages: I think we need to have a balance between both. Not everything we put in the messages should be first class fields. I think the code owner will make the best call here. Regarding sugared vs raw logger: As a compromise, I can add both sugared and raw loggers to the context and FromContext can be changed to SugaredFromContext() and RawFromContext() and force developers to explicitly pick one to use. My concern with using raw logger as the default is that people will put every single interpolated message into a first class field, which might not be something desirable especially with log indexing systems. WDYT? |
👍 That makes sense.
We should have guidance on how to choose between them. I propose the following: If a field can act as a filter key to group messages, it should be a structured log field. If a field's searchability will lead to confusion (for example, if it tends to be parsed incorrectly or is easily confused with a different field) then it should not be a structured log field. If a field's key is dynamic with high cardinality, then it should be interpolated (though in this case, the dynamic key should likely be a value itself). Elasticsearch works fairly well with indices having thousands of fields, so I wouldn't worry about anything from that perspective, especially since we control all the code generating controller logs.
I don't think that's necessary. Specific cases needing the raw logger can call |
…e one. Addressed the rest of PR feedback.
|
Oops forgot the /lgtm |
|
And that it needs to be on its own line |
|
/lgtm |
|
I don't think I'm a valid approver, but just in case |
|
/assign @mattmoor, @evankanderson @mattmoor @evankanderson , can one of you please approve? Grant can't approve. Merging the branch removed the LGTM label, so I will need an LGTM label as well. |
|
/retest |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: grantr, mattmoor, mdemirhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ces/release-v1.15 Update Konflux references
This changes the route controller to use a structured logging library (zap). Sending this to get an early feedback - I will roll this out to other controllers once we agree on how we want to do logging going forward for elafros controllers.
For route controller, what I do is as follows:
These labels will allow us to filter logs for a specific route, or namespace or for the entire route controller (of course assuming that we will have a good logging backend - kubectl logs will suck for this).
I can't find a better way of propagating the logger between function calls. Let me know if there is a better way of doing this. If this was context based (requests), I could have added the logger to the context, but we don't operate on a context in controllers as far as I understand.