Structured logging for route controller by mdemirhan · Pull Request #888 · knative/serving

mdemirhan · 2018-05-14T20:02:09Z

This changes the route controller to use a structured logging library (zap). Sending this to get an early feedback - I will roll this out to other controllers once we agree on how we want to do logging going forward for elafros controllers.

For route controller, what I do is as follows:

At the root level (the run loop of the controller), key "elafros.dev/controller" with value "route" is inserted to all logs. (see route.go line 139).
At individual work item level (i.e. updating a specific route), "elafros.dev/route" and "elafros.dev/namespace" are inserted (see line 339).

These labels will allow us to filter logs for a specific route, or namespace or for the entire route controller (of course assuming that we will have a good logging backend - kubectl logs will suck for this).

I can't find a better way of propagating the logger between function calls. Let me know if there is a better way of doing this. If this was context based (requests), I could have added the logger to the context, but we don't operate on a context in controllers as far as I understand.

…onfiguration in case of bad configurations.

mdemirhan · 2018-05-14T21:57:49Z

/retest

vaikas · 2018-05-14T21:14:01Z

+		if err := json.Unmarshal([]byte(zapJSON), &loggingCfg); err != nil {
+			// Failed to parse the logging configuration. Fall back to production config
+			loggingCfg = zap.NewProductionConfig()
+			logInitFailure = fmt.Sprintf("Error parsing logging configuration. Using default config: %v", err)


This is a total nit, but the way I'd read this log message would be that the default config would be printed after the "Using default config: " but we print the error there. I can't think off the top of my head how to make it more readable, so unless somebody can make a suggestion, it's fine.

I will try to clarify this. Thanks for pointing this out!

vaikas · 2018-05-14T22:20:04Z

+			logger.Infof("Using fluentd sidecar image: %s", loggingFluentSidecarImage)
 		} else {
-			glog.Fatal("missing required flag: -fluentdSidecarImage")
+			logger.Fatalf("missing required flag: -fluentdSidecarImage")


just Fatal?

dprotaso · 2018-05-15T04:10:52Z

 }

-func (c *Controller) reconcileIngress(route *v1alpha1.Route) error {
+func (c *Controller) reconcileIngress(route *v1alpha1.Route, logger *zap.SugaredLogger) error {


In some CF codebases there's a pattern used for tracing/debugging that looks like:

func (c *Controller) reconcileIngress(logger *zap.SugaredLogger, ...) error { logger = logger.Named("reconcileIngress") // this hints at the call path logger.Debug("start") defer logger.Debug("end") ... }

If loggers are named properly you'll more or less have a call path, which is especially useful if you don't have file line numbers. The debug log statements help detect any long blocking calls/deadlocks without needed to instrument signal handling to dump the list of goroutines.

I like this pattern quite a bit. We have a best practices document and I will add this there. I am not an expert in the controller base and I would be hesitant to add my own instrumentation here and I would rather have code owners make such changes.

dprotaso · 2018-05-15T04:12:59Z

-func (c *Controller) syncTrafficTargetsAndUpdateRouteStatus(route *v1alpha1.Route) (*v1alpha1.Route, error) {
-	c.consolidateTrafficTargets(route)
-	configMap, revMap, err := c.getDirectTrafficTargets(route)
+func (c *Controller) syncTrafficTargetsAndUpdateRouteStatus(route *v1alpha1.Route, logger *zap.SugaredLogger) (*v1alpha1.Route, error) {


Thoughts on having the logger always be the first argument?

@grantr and @jonjohnsonjr WDYT? You are the main owners of these files and if it doesn't bother anyone, I can more logger to be the first argument.

I agree with @dprotaso - I think I'd prefer this to be somehow embedded in a context.Context and made the first argument. It seems like we'll want to add context to everything, anyway, and this fits that pattern pretty well...

If we're adding arguments everywhere, I'd also prefer to use a context.Context, but this makes me wonder: Istio already uses zap; how does it deal with log context?

@dprotaso @jonjohnsonjr @grantr - I converted from passing logger around to passing context around as much as possible. Please take a look and let me know how this one looks.

dprotaso · 2018-05-15T04:20:23Z

+	logger.Info("Creating/Updating placeholder k8s services")

-	if err = c.reconcilePlaceholderService(route); err != nil {
+	if err = c.reconcilePlaceholderService(route, logger); err != nil {


Curious if there's any external calls which would benefit from having a context.Context with a timeout. If so passing around a context with a nested the logger as a value would change these function definitions.

See my discussion here: #347 (comment)

There are external calls to k8s APIs and but as far as I know, they don't take context as an input.

…ger.

jonjohnsonjr · 2018-05-15T18:26:22Z

 func main() {
 	flag.Parse()

+	zapJSON := loggingZapCfg.Get()


This is a lot of boilerplate -- is it possible to move it into pkg/logging?

jonjohnsonjr · 2018-05-15T18:27:49Z

+
+	// We will use sugared logger over the raw logger because it provides a
+	// better balance between performance and ease of use.
+	// For performance critical parts, once can call into "Desugar" method and


A bit unrelated, but I'm skeptical that we have any performance critical parts that would be materially affected by logging perf.

I agree. It can be useful in the date plane but I don't expect any of our control plane code needing this. That being said, I wanted to add the comment just in case we need it in the future for some reason.

Performance critical situations:

Scaling up or down. Any controller in the scale path must be low latency.

High revision cardinality. When informers have thousands of revisions to keep track of, memory pressure and GC pauses may be more of an issue.

The sugared logger is probably fine for other situations. This isn't intended to be a recommendation for this PR, but I prefer unsugared logging interface based on personal experience with Golang structured logging:

It preserves type-safety at the expense of slightly greater verbosity. Several logging bugs were caught by the compiler that would have been runtime bugs otherwise.

Debuggability suffers when developers are afraid to log for fear of impacting performance. The implementation of leveled logging in particular should be as close to zero cost as possible for logs below the current level.

jonjohnsonjr · 2018-05-15T18:30:04Z


 	// Start the informer factories to begin populating the informer caches
-	glog.Info("Starting Route controller")
+	c.logger.Info("Starting Route controller")


It seems strange to access the logger like this in some places and via the context in others. I think ideally we would pass ctx through everything, but IDK if it's worth worrying about for this PR.

We cannot change the signatures of methods called by k8s (this method + addConfigurationEvent, updateConfigurationEvent and updateIngressEvent). That is why these places start with the controller wide logger and then they need to rebuild their logger with proper route name and namespace. A bit unfortunate, but I couldn't come up with something that will look better.

Ah, I see. My mistake.

jonjohnsonjr · 2018-05-15T18:32:39Z

 	}

+	logger := loggerWithRouteInfo(c.logger, namespace, name)
+	ctx := logging.WithLogger(context.Background(), logger)


Perhaps context.TODO() instead? (in most places)

jonjohnsonjr · 2018-05-16T06:31:46Z

kubectl logs will suck for this

What does this mean - is the output really noisy?

mdemirhan · 2018-05-16T15:12:01Z

@jonjohnsonjr The output is going to be as noisy as it is today - no change there. We need to cleanup a lot of our logging which seems either useless or too verbose :) The reason I said kubectl will suck for this is the shape of the output: by default, it will encode the logs in JSON - not great for human consumption. For development though, one can change the config and set the encoding to "console" rather than "JSON" and they will get a more readable output, but it will still encode key value pairs as JSON.

mdemirhan · 2018-05-17T15:21:52Z

@dprotaso @jonjohnsonjr @grantr @vaikas-google Any concerns with this PR that you think is still unresolved? If not, I would like to check this in and start converting the rest of the components to use this model.

# Conflicts: # pkg/controller/route/route.go

grantr

Great work @mdemirhan! This is a nice improvement. I have a few recommendations and one requested change (returning a noop logger instead of nil from FromContext).

I noticed most of the logging calls here still use interpolated messages instead of structured logging. Do you intend to convert those in a later PR?

Context might seem cumbersome now, but I think it will pay dividends in the future. Examples of other ways it could be useful:

Set a deadline on a reconcile iteration.
Propagate opencensus tags to reconcile methods, allowing things like identifying which routes take longer to reconcile.
Reconcile tracing. I don't know if this is possible, but what if we could generate a trace based on the generation or resourceVersion of a resource that shows everything that happened when that resource was updated?

grantr · 2018-05-18T17:25:03Z

+      "errorOutputPaths": ["stderr"],
+      "encoding": "json",
+      "encoderConfig": {
+        "timeKey": "",


Why is timeKey empty? Do we not log timestamps?

Kubernetes logging driver already adds a timestamp for logs collected from stdout and stderr. That is why we are omitting it here and preventing dual timestamps in the logs.

grantr · 2018-05-18T17:34:40Z

+
+	// We will use sugared logger over the raw logger because it provides a
+	// better balance between performance and ease of use.
+	// For performance critical parts, once can call into "Desugar" method and


Performance critical situations:

Scaling up or down. Any controller in the scale path must be low latency.

High revision cardinality. When informers have thousands of revisions to keep track of, memory pressure and GC pauses may be more of an issue.

The sugared logger is probably fine for other situations. This isn't intended to be a recommendation for this PR, but I prefer unsugared logging interface based on personal experience with Golang structured logging:

It preserves type-safety at the expense of slightly greater verbosity. Several logging bugs were caught by the compiler that would have been runtime bugs otherwise.

Debuggability suffers when developers are afraid to log for fear of impacting performance. The implementation of leveled logging in particular should be as close to zero cost as possible for logs below the current level.

grantr · 2018-05-18T17:51:31Z


 func main() {
 	flag.Parse()
+	logger := logging.NewLogger(loggingZapCfg.Get())


Possible enhancement: name this logger with the name of the binary:

logger := logging.NewLogger(loggingZapCfg.Get()).Named("ela-controller")

That gives the user a way to search for all messages generated by this binary (which includes multiple controllers).

grantr · 2018-05-18T17:51:44Z


-	glog.Infof("Route controller Init")
+	// Enrich the logs with controller type set to route
+	logger = logger.With(zap.String(logging.LogKeyControllerType, "route"))


Possible enhancement: Also add route to the logger name, for example:

logger = logger.Named("route").With(zap.String(logging.LogKeyControllerType, "route"))

grantr · 2018-05-18T18:01:14Z

 		return err
 	}
-	glog.Infof("Created service: %q", service.Name)
+	logging.FromContext(ctx).Infof("Created service %s", service.Name)


Why not set logger to a variable in this method, since there are two log calls?

Didn't want to increase the code size and I had a random threshold of 3 logger calls to assign a variable :) I will change these.

grantr · 2018-05-18T18:24:16Z

+	logger := rawLogger.Sugar()
+	if len(logInitFailure) > 0 {
+		logger.Error(logInitFailure)
+	}


Another way to structure this would be to split the config parsing into a separate function that returns (zap.Config, error). Then there's no need to check for emptiness of the logInitFailure string.

func NewLogger(configJSON string) *zap.SugaredLogger { var logger zap.Logger loggingCfg, err := parseConfig(configJSON) if err != nil { loggingCfg = zap.NewProductionConfig() logger = loggingCfg.Build() logger.Error("Failed to parse the logging config. Will use default config.", zap.Error(err)) } else { logger = loggingCfg.Build() } return logger.Sugar() }

Good suggestion. I will fix that.

grantr · 2018-05-18T18:29:47Z

+limitations under the License.
+*/
+
+package logging


What do you think about moving these to the elafros/logging/logkey package? That gives the constant names a cleaner namespace. I think brevity will become more important as we update the logger calls to emit fields instead of interpolated messages.

// this c.logger.Error("Error fetching route upon ingress creation", zap.Error(err), zap.String(logging.LogKeyNamespace, ns), zap.String(logging.LogKeyRoute, routeName)) // versus this c.logger.Error("Error fetching route upon ingress creation", zap.Error(err), zap.String(logkey.Namespace, ns), zap.String(logkey.Route, routeName))

Makes sense. I will move to logkey package.

grantr · 2018-05-18T18:30:25Z

+	"go.uber.org/zap"
+)
+
+type loggerKey struct{}


Thanks for doing it this way! Using a string as key is a common mistake with context values.

grantr · 2018-05-18T18:32:41Z

+	if logger, ok := ctx.Value(loggerKey{}).(*zap.SugaredLogger); ok {
+		return logger
+	}
+	return nil


This should return logger.NewNop(). Otherwise if logger is missing from context, callers will panic when they try to call logger methods on nil.

I was doing this intentionally so that invalid contexts are caught during development. I will change this to what you describe though.

That makes sense! What about this compromise: instead of silently returning a noop logger that will lose potentially valuable log messages, this package can create a package-specific fallback logger, named as such, and return that when logger isn't present. That way no log messages are lost due to incorrect contexts, but it's still detectable when that happens (because messages are tagged with logger=fallback).

Sounds great. I will send an update shortly.

grantr · 2018-05-18T18:33:19Z

+	if got := FromContext(ctx); want != got {
+		t.Errorf("unexpected logger in context. want: %v, got: %v", want, got)
+	}
+}


Add a test for a context that doesn't have a logger value.

Seems like there is already a test covering this. Line 40.

Oops, I missed that 😅

mdemirhan · 2018-05-18T19:53:22Z

@grantr Thanks for the detailed review. Regarding interpolated messages: I was planning to open GitHub issues and assign to owners of the code to handle fully structuring the logs. My change only adds the basic structure (controller name, route name, namespace). I myself do not have the full context on all the code and might not make the right call. Also, having everything fully structured vs interpolated messages: I think we need to have a balance between both. Not everything we put in the messages should be first class fields. I think the code owner will make the best call here.

Regarding sugared vs raw logger: As a compromise, I can add both sugared and raw loggers to the context and FromContext can be changed to SugaredFromContext() and RawFromContext() and force developers to explicitly pick one to use. My concern with using raw logger as the default is that people will put every single interpolated message into a first class field, which might not be something desirable especially with log indexing systems. WDYT?

grantr · 2018-05-18T20:09:45Z

I was planning to open GitHub issues and assign to owners of the code to handle fully structuring the logs.

👍 That makes sense.

I think the code owner will make the best call here.

We should have guidance on how to choose between them. I propose the following:

If a field can act as a filter key to group messages, it should be a structured log field. If a field's searchability will lead to confusion (for example, if it tends to be parsed incorrectly or is easily confused with a different field) then it should not be a structured log field. If a field's key is dynamic with high cardinality, then it should be interpolated (though in this case, the dynamic key should likely be a value itself).

Elasticsearch works fairly well with indices having thousands of fields, so I wouldn't worry about anything from that perspective, especially since we control all the code generating controller logs.

As a compromise, I can add both sugared and raw loggers to the context

I don't think that's necessary. Specific cases needing the raw logger can call Desugar(). Alternately, a performance-sensitive controller could use a desugared logger in its contexts by default.

…e one. Addressed the rest of PR feedback.

grantr · 2018-05-21T16:58:07Z

Oops forgot the /lgtm

grantr · 2018-05-21T16:59:13Z

And that it needs to be on its own line
/lgtm

grantr · 2018-05-21T17:15:11Z

/lgtm

grantr · 2018-05-21T17:42:22Z

I don't think I'm a valid approver, but just in case
/approve

mdemirhan · 2018-05-21T18:32:08Z

/assign @mattmoor, @evankanderson

@mattmoor @evankanderson , can one of you please approve? Grant can't approve. Merging the branch removed the LGTM label, so I will need an LGTM label as well.

mdemirhan · 2018-05-21T19:57:59Z

/retest

mattmoor · 2018-05-22T15:50:09Z

/approve
/lgtm

google-prow-robot · 2018-05-22T15:50:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: grantr, mattmoor, mdemirhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mattmoor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ces/release-v1.15 Update Konflux references

mdemirhan added 2 commits May 14, 2018 12:13

Change route controller to use zap structured logging library.

deb1bb4

Fixes.

835fb1b

mdemirhan requested review from dprotaso, grantr, mattmoor, vaikas and yanweiguo May 14, 2018 20:02

mdemirhan requested a review from tcnghia as a code owner May 14, 2018 20:02

mdemirhan requested a review from a team May 14, 2018 20:02

google-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 14, 2018

Make config failures soft failures and fall back to default logging c…

fdd9e09

…onfiguration in case of bad configurations.

Remove logging folder and put the constants under controller folder.

2937b9b

vaikas reviewed May 14, 2018

View reviewed changes

dprotaso reviewed May 15, 2018

View reviewed changes

mdemirhan added 2 commits May 15, 2018 11:18

Attach logger into context and pass context around instead of the log…

68e58c2

…ger.

Update dependencies.

93717a1

jonjohnsonjr reviewed May 15, 2018

View reviewed changes

Address PR comments.

3e86ead

jonjohnsonjr approved these changes May 17, 2018

View reviewed changes

mdemirhan and others added 3 commits May 17, 2018 08:41

Merge branch 'master' into zap

f46e960

Merge branch 'master-upstream' into zap

5c80740

# Conflicts: # pkg/controller/route/route.go

Merge branch 'master' into zap

d0354f4

grantr suggested changes May 18, 2018

View reviewed changes

Changed the logger to return a fallback logger if context doesn't hav…

2bd31fd

…e one. Addressed the rest of PR feedback.

google-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2018

google-prow-robot assigned grantr May 21, 2018

google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2018

Merge branch 'master' into zap

065d868

google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label May 21, 2018

google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2018

mdemirhan removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2018

Merge branch 'master' into zap

319127e

google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label May 21, 2018

mdemirhan assigned mattmoor and evankanderson May 21, 2018

mdemirhan and others added 2 commits May 21, 2018 11:33

Change fallback logger to use a name instead of a separate key.

85b8754

Merge branch 'master' into zap

c0cc82d

Merge branch 'master' into zap

60db81a

google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 22, 2018

google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2018

google-prow-robot merged commit 4de6316 into knative:master May 22, 2018

mdemirhan deleted the zap branch May 22, 2018 15:54

dprotaso mentioned this pull request Jun 8, 2018

replace log with glog for all tests #1104

Merged

tcnghia mentioned this pull request Oct 19, 2018

Add mdemirhan to /OWNERS #2264

Merged

dprotaso mentioned this pull request Jun 19, 2019

Add @dprotaso as a serving approver #4426

Merged

skonto pushed a commit to skonto/serving that referenced this pull request Oct 14, 2024

Merge pull request knative#888 from openshift-knative/konflux/referen…

b1ef564

…ces/release-v1.15 Update Konflux references

Conversation

mdemirhan commented May 14, 2018

Uh oh!

mdemirhan commented May 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonjohnsonjr commented May 16, 2018

Uh oh!

mdemirhan commented May 16, 2018

Uh oh!

mdemirhan commented May 17, 2018

Uh oh!

grantr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!