Skip to content

Move 429 results from Error to Debug#46

Closed
bboreham wants to merge 1 commit intomasterfrom
reduce-logging
Closed

Move 429 results from Error to Debug#46
bboreham wants to merge 1 commit intomasterfrom
reduce-logging

Conversation

@bboreham
Copy link
Collaborator

429 indicates an over-limit condition which will have been logged by the component that detected it, so we don't need to log it again on the calling side.

Example:

{"log":"time=\"2017-07-15T21:26:35Z\" level=warning msg=\"gRPC /cortex.Ingester/Push (rpc error: code = Code(429) desc = per-metric series limit exceeded) 2.412325ms\" \n",[...]{"name":"ingester"}
{"log":"WARN: 2017/07/17 12:22:03.256129 POST /api/prom/push (429) 11.665082ms\n",[...],"container_name":"authfe"}

429 indicates an over-limit condition which will have been logged by the component that detected it.
@rade
Copy link
Member

rade commented Jul 18, 2017

I worry this will make problem investigation harder. Right now, when our monitoring shows a certain error rate for a component - say nHz, the logs for that component will have a corresponding number of error entries (n per second).

So in the example, the error currently shows up in the metrics of the ingester, distributor and authfe. And there will be corresponding error messages in the logs of all three. Whereas with the change, the metrics will be the same, but the error log messages will only show up in the ingester.

We usually investigate errors "front to back": we'd typically first look at the error rate in authfe, look at the logs to figure out what it is, and what, if any, component it originates from, then look at the error rate of that component, then look at its logs, etc, until we get to the bottom.

Only logging errors at the bottom component will break this method of investigation.

i := &interceptor{ResponseWriter: w, statusCode: http.StatusOK}
next.ServeHTTP(i, r)
if 100 <= i.statusCode && i.statusCode < 400 {
if 100 <= i.statusCode && (i.statusCode < 400 || i.statusCode < 429) {

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@jml
Copy link
Contributor

jml commented Jul 18, 2017

@jml
Copy link
Contributor

jml commented Jul 18, 2017

@rade A valid concern. However, note that you can still employ a front-to-back approach by looking at the metrics, chasing down which service the 429s come from.

@bboreham
Copy link
Collaborator Author

This PR was obviated by #59 which moved all 400-level errors to debug.

@bboreham bboreham closed this Dec 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants