Skip to content

Structured logging for route controller#888

Merged
google-prow-robot merged 19 commits intoknative:masterfrom
mdemirhan:zap
May 22, 2018
Merged

Structured logging for route controller#888
google-prow-robot merged 19 commits intoknative:masterfrom
mdemirhan:zap

Conversation

@mdemirhan
Copy link
Copy Markdown
Contributor

This changes the route controller to use a structured logging library (zap). Sending this to get an early feedback - I will roll this out to other controllers once we agree on how we want to do logging going forward for elafros controllers.

For route controller, what I do is as follows:

  • At the root level (the run loop of the controller), key "elafros.dev/controller" with value "route" is inserted to all logs. (see route.go line 139).
  • At individual work item level (i.e. updating a specific route), "elafros.dev/route" and "elafros.dev/namespace" are inserted (see line 339).

These labels will allow us to filter logs for a specific route, or namespace or for the entire route controller (of course assuming that we will have a good logging backend - kubectl logs will suck for this).

I can't find a better way of propagating the logger between function calls. Let me know if there is a better way of doing this. If this was context based (requests), I could have added the logger to the context, but we don't operate on a context in controllers as far as I understand.

@mdemirhan mdemirhan requested a review from tcnghia as a code owner May 14, 2018 20:02
@mdemirhan mdemirhan requested a review from a team May 14, 2018 20:02
@google-prow-robot google-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 14, 2018
…onfiguration in case of bad configurations.
@mdemirhan
Copy link
Copy Markdown
Contributor Author

/retest

Comment thread cmd/ela-controller/main.go Outdated
if err := json.Unmarshal([]byte(zapJSON), &loggingCfg); err != nil {
// Failed to parse the logging configuration. Fall back to production config
loggingCfg = zap.NewProductionConfig()
logInitFailure = fmt.Sprintf("Error parsing logging configuration. Using default config: %v", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a total nit, but the way I'd read this log message would be that the default config would be printed after the "Using default config: " but we print the error there. I can't think off the top of my head how to make it more readable, so unless somebody can make a suggestion, it's fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to clarify this. Thanks for pointing this out!

Comment thread cmd/ela-controller/main.go Outdated
logger.Infof("Using fluentd sidecar image: %s", loggingFluentSidecarImage)
} else {
glog.Fatal("missing required flag: -fluentdSidecarImage")
logger.Fatalf("missing required flag: -fluentdSidecarImage")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just Fatal?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

Comment thread pkg/controller/route/route.go Outdated
}

func (c *Controller) reconcileIngress(route *v1alpha1.Route) error {
func (c *Controller) reconcileIngress(route *v1alpha1.Route, logger *zap.SugaredLogger) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some CF codebases there's a pattern used for tracing/debugging that looks like:

func (c *Controller) reconcileIngress(logger *zap.SugaredLogger, ...) error {
  logger = logger.Named("reconcileIngress") // this hints at the call path
  logger.Debug("start")
  defer logger.Debug("end")
...
}

If loggers are named properly you'll more or less have a call path, which is especially useful if you don't have file line numbers. The debug log statements help detect any long blocking calls/deadlocks without needed to instrument signal handling to dump the list of goroutines.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this pattern quite a bit. We have a best practices document and I will add this there. I am not an expert in the controller base and I would be hesitant to add my own instrumentation here and I would rather have code owners make such changes.

Comment thread pkg/controller/route/route.go Outdated
func (c *Controller) syncTrafficTargetsAndUpdateRouteStatus(route *v1alpha1.Route) (*v1alpha1.Route, error) {
c.consolidateTrafficTargets(route)
configMap, revMap, err := c.getDirectTrafficTargets(route)
func (c *Controller) syncTrafficTargetsAndUpdateRouteStatus(route *v1alpha1.Route, logger *zap.SugaredLogger) (*v1alpha1.Route, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on having the logger always be the first argument?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grantr and @jonjohnsonjr WDYT? You are the main owners of these files and if it doesn't bother anyone, I can more logger to be the first argument.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dprotaso - I think I'd prefer this to be somehow embedded in a context.Context and made the first argument. It seems like we'll want to add context to everything, anyway, and this fits that pattern pretty well...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're adding arguments everywhere, I'd also prefer to use a context.Context, but this makes me wonder: Istio already uses zap; how does it deal with log context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dprotaso @jonjohnsonjr @grantr - I converted from passing logger around to passing context around as much as possible. Please take a look and let me know how this one looks.

Comment thread pkg/controller/route/route.go Outdated
logger.Info("Creating/Updating placeholder k8s services")

if err = c.reconcilePlaceholderService(route); err != nil {
if err = c.reconcilePlaceholderService(route, logger); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if there's any external calls which would benefit from having a context.Context with a timeout. If so passing around a context with a nested the logger as a value would change these function definitions.

See my discussion here: #347 (comment)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are external calls to k8s APIs and but as far as I know, they don't take context as an input.

Comment thread cmd/ela-controller/main.go Outdated
func main() {
flag.Parse()

zapJSON := loggingZapCfg.Get()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of boilerplate -- is it possible to move it into pkg/logging?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Comment thread pkg/controller/route/route.go Outdated

// We will use sugared logger over the raw logger because it provides a
// better balance between performance and ease of use.
// For performance critical parts, once can call into "Desugar" method and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit unrelated, but I'm skeptical that we have any performance critical parts that would be materially affected by logging perf.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It can be useful in the date plane but I don't expect any of our control plane code needing this. That being said, I wanted to add the comment just in case we need it in the future for some reason.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance critical situations:

  • Scaling up or down. Any controller in the scale path must be low latency.
  • High revision cardinality. When informers have thousands of revisions to keep track of, memory pressure and GC pauses may be more of an issue.

The sugared logger is probably fine for other situations. This isn't intended to be a recommendation for this PR, but I prefer unsugared logging interface based on personal experience with Golang structured logging:

  • It preserves type-safety at the expense of slightly greater verbosity. Several logging bugs were caught by the compiler that would have been runtime bugs otherwise.
  • Debuggability suffers when developers are afraid to log for fear of impacting performance. The implementation of leveled logging in particular should be as close to zero cost as possible for logs below the current level.

Comment thread pkg/controller/route/route.go Outdated

// Start the informer factories to begin populating the informer caches
glog.Info("Starting Route controller")
c.logger.Info("Starting Route controller")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange to access the logger like this in some places and via the context in others. I think ideally we would pass ctx through everything, but IDK if it's worth worrying about for this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot change the signatures of methods called by k8s (this method + addConfigurationEvent, updateConfigurationEvent and updateIngressEvent). That is why these places start with the controller wide logger and then they need to rebuild their logger with proper route name and namespace. A bit unfortunate, but I couldn't come up with something that will look better.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. My mistake.

Comment thread pkg/controller/route/route.go Outdated
}

logger := loggerWithRouteInfo(c.logger, namespace, name)
ctx := logging.WithLogger(context.Background(), logger)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps context.TODO() instead? (in most places)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@jonjohnsonjr
Copy link
Copy Markdown
Contributor

kubectl logs will suck for this

What does this mean - is the output really noisy?

@mdemirhan
Copy link
Copy Markdown
Contributor Author

@jonjohnsonjr The output is going to be as noisy as it is today - no change there. We need to cleanup a lot of our logging which seems either useless or too verbose :) The reason I said kubectl will suck for this is the shape of the output: by default, it will encode the logs in JSON - not great for human consumption. For development though, one can change the config and set the encoding to "console" rather than "JSON" and they will get a more readable output, but it will still encode key value pairs as JSON.

@mdemirhan
Copy link
Copy Markdown
Contributor Author

@dprotaso @jonjohnsonjr @grantr @vaikas-google Any concerns with this PR that you think is still unresolved? If not, I would like to check this in and start converting the rest of the components to use this model.

Copy link
Copy Markdown
Contributor

@grantr grantr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @mdemirhan! This is a nice improvement. I have a few recommendations and one requested change (returning a noop logger instead of nil from FromContext).

I noticed most of the logging calls here still use interpolated messages instead of structured logging. Do you intend to convert those in a later PR?

Context might seem cumbersome now, but I think it will pay dividends in the future. Examples of other ways it could be useful:

  • Set a deadline on a reconcile iteration.
  • Propagate opencensus tags to reconcile methods, allowing things like identifying which routes take longer to reconcile.
  • Reconcile tracing. I don't know if this is possible, but what if we could generate a trace based on the generation or resourceVersion of a resource that shows everything that happened when that resource was updated?

Comment thread config/elaconfig.yaml
"errorOutputPaths": ["stderr"],
"encoding": "json",
"encoderConfig": {
"timeKey": "",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is timeKey empty? Do we not log timestamps?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes logging driver already adds a timestamp for logs collected from stdout and stderr. That is why we are omitting it here and preventing dual timestamps in the logs.

Comment thread pkg/controller/route/route.go Outdated

// We will use sugared logger over the raw logger because it provides a
// better balance between performance and ease of use.
// For performance critical parts, once can call into "Desugar" method and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance critical situations:

  • Scaling up or down. Any controller in the scale path must be low latency.
  • High revision cardinality. When informers have thousands of revisions to keep track of, memory pressure and GC pauses may be more of an issue.

The sugared logger is probably fine for other situations. This isn't intended to be a recommendation for this PR, but I prefer unsugared logging interface based on personal experience with Golang structured logging:

  • It preserves type-safety at the expense of slightly greater verbosity. Several logging bugs were caught by the compiler that would have been runtime bugs otherwise.
  • Debuggability suffers when developers are afraid to log for fear of impacting performance. The implementation of leveled logging in particular should be as close to zero cost as possible for logs below the current level.

Comment thread cmd/ela-controller/main.go Outdated

func main() {
flag.Parse()
logger := logging.NewLogger(loggingZapCfg.Get())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible enhancement: name this logger with the name of the binary:

logger := logging.NewLogger(loggingZapCfg.Get()).Named("ela-controller")

That gives the user a way to search for all messages generated by this binary (which includes multiple controllers).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Comment thread pkg/controller/route/route.go Outdated

glog.Infof("Route controller Init")
// Enrich the logs with controller type set to route
logger = logger.With(zap.String(logging.LogKeyControllerType, "route"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible enhancement: Also add route to the logger name, for example:

logger = logger.Named("route").With(zap.String(logging.LogKeyControllerType, "route"))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Comment thread pkg/controller/route/route.go Outdated
return err
}
glog.Infof("Created service: %q", service.Name)
logging.FromContext(ctx).Infof("Created service %s", service.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not set logger to a variable in this method, since there are two log calls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't want to increase the code size and I had a random threshold of 3 logger calls to assign a variable :) I will change these.

Comment thread pkg/logging/config.go
logger := rawLogger.Sugar()
if len(logInitFailure) > 0 {
logger.Error(logInitFailure)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to structure this would be to split the config parsing into a separate function that returns (zap.Config, error). Then there's no need to check for emptiness of the logInitFailure string.

func NewLogger(configJSON string) *zap.SugaredLogger {
  var logger zap.Logger
  loggingCfg, err := parseConfig(configJSON)
  if err != nil {
    loggingCfg = zap.NewProductionConfig()
    logger = loggingCfg.Build()
    logger.Error("Failed to parse the logging config. Will use default config.", zap.Error(err))
  } else {
    logger = loggingCfg.Build()
  }

  return logger.Sugar()
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I will fix that.

Comment thread pkg/logging/constants.go Outdated
limitations under the License.
*/

package logging
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving these to the elafros/logging/logkey package? That gives the constant names a cleaner namespace. I think brevity will become more important as we update the logger calls to emit fields instead of interpolated messages.

// this
c.logger.Error("Error fetching route upon ingress creation",
  zap.Error(err), zap.String(logging.LogKeyNamespace, ns), zap.String(logging.LogKeyRoute, routeName))

// versus this
c.logger.Error("Error fetching route upon ingress creation",
  zap.Error(err), zap.String(logkey.Namespace, ns), zap.String(logkey.Route, routeName))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I will move to logkey package.

Comment thread pkg/logging/logger.go
"go.uber.org/zap"
)

type loggerKey struct{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing it this way! Using a string as key is a common mistake with context values.

Comment thread pkg/logging/logger.go Outdated
if logger, ok := ctx.Value(loggerKey{}).(*zap.SugaredLogger); ok {
return logger
}
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return logger.NewNop(). Otherwise if logger is missing from context, callers will panic when they try to call logger methods on nil.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing this intentionally so that invalid contexts are caught during development. I will change this to what you describe though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! What about this compromise: instead of silently returning a noop logger that will lose potentially valuable log messages, this package can create a package-specific fallback logger, named as such, and return that when logger isn't present. That way no log messages are lost due to incorrect contexts, but it's still detectable when that happens (because messages are tagged with logger=fallback).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. I will send an update shortly.

if got := FromContext(ctx); want != got {
t.Errorf("unexpected logger in context. want: %v, got: %v", want, got)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for a context that doesn't have a logger value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there is already a test covering this. Line 40.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed that 😅

@mdemirhan
Copy link
Copy Markdown
Contributor Author

@grantr Thanks for the detailed review. Regarding interpolated messages: I was planning to open GitHub issues and assign to owners of the code to handle fully structuring the logs. My change only adds the basic structure (controller name, route name, namespace). I myself do not have the full context on all the code and might not make the right call. Also, having everything fully structured vs interpolated messages: I think we need to have a balance between both. Not everything we put in the messages should be first class fields. I think the code owner will make the best call here.

Regarding sugared vs raw logger: As a compromise, I can add both sugared and raw loggers to the context and FromContext can be changed to SugaredFromContext() and RawFromContext() and force developers to explicitly pick one to use. My concern with using raw logger as the default is that people will put every single interpolated message into a first class field, which might not be something desirable especially with log indexing systems. WDYT?

@grantr
Copy link
Copy Markdown
Contributor

grantr commented May 18, 2018

I was planning to open GitHub issues and assign to owners of the code to handle fully structuring the logs.

👍 That makes sense.

I think the code owner will make the best call here.

We should have guidance on how to choose between them. I propose the following:

If a field can act as a filter key to group messages, it should be a structured log field. If a field's searchability will lead to confusion (for example, if it tends to be parsed incorrectly or is easily confused with a different field) then it should not be a structured log field. If a field's key is dynamic with high cardinality, then it should be interpolated (though in this case, the dynamic key should likely be a value itself).

Elasticsearch works fairly well with indices having thousands of fields, so I wouldn't worry about anything from that perspective, especially since we control all the code generating controller logs.

As a compromise, I can add both sugared and raw loggers to the context

I don't think that's necessary. Specific cases needing the raw logger can call Desugar(). Alternately, a performance-sensitive controller could use a desugared logger in its contexts by default.

@google-prow-robot google-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2018
@grantr
Copy link
Copy Markdown
Contributor

grantr commented May 21, 2018

Oops forgot the /lgtm

@grantr
Copy link
Copy Markdown
Contributor

grantr commented May 21, 2018

And that it needs to be on its own line
/lgtm

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2018
@google-prow-robot google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label May 21, 2018
@grantr
Copy link
Copy Markdown
Contributor

grantr commented May 21, 2018

/lgtm

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2018
@mdemirhan mdemirhan removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2018
@grantr
Copy link
Copy Markdown
Contributor

grantr commented May 21, 2018

I don't think I'm a valid approver, but just in case
/approve

@google-prow-robot google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label May 21, 2018
@mdemirhan
Copy link
Copy Markdown
Contributor Author

/assign @mattmoor, @evankanderson

@mattmoor @evankanderson , can one of you please approve? Grant can't approve. Merging the branch removed the LGTM label, so I will need an LGTM label as well.

@mdemirhan
Copy link
Copy Markdown
Contributor Author

/retest

@mattmoor
Copy link
Copy Markdown
Member

/approve
/lgtm

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 22, 2018
@google-prow-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: grantr, mattmoor, mdemirhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-prow-robot google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2018
@google-prow-robot google-prow-robot merged commit 4de6316 into knative:master May 22, 2018
@mdemirhan mdemirhan deleted the zap branch May 22, 2018 15:54
skonto pushed a commit to skonto/serving that referenced this pull request Oct 14, 2024
…ces/release-v1.15

Update Konflux references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants