Skip to content

Tidy up lifecycle, query, and ingestion logging.#8889

Merged
fjy merged 5 commits intoapache:masterfrom
gianm:operator-logging
Nov 19, 2019
Merged

Tidy up lifecycle, query, and ingestion logging.#8889
fjy merged 5 commits intoapache:masterfrom
gianm:operator-logging

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Nov 17, 2019

The goal of this patch is to improve the clarity and usefulness of
Druid's logging for cluster operators. For more information, see
https://twitter.com/cowtowncoder/status/1195469299814555648.

Concretely, this patch does the following:

  • Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the
    goal of reducing redundancy and improving clarity by avoiding
    showing rarely-useful log messages. This includes most "starting"
    and "stopping" messages, and most messages related to individual
    columns.
  • Adds new log4j2 templates that show operators how to enabled DEBUG
    logging for certain important packages. These templates can be found in
    the _common configuration folder in the example clusters under conf/druid.
  • Eliminate stack traces for query errors, unless log level is DEBUG
    or more. This is useful because query errors often indicate user
    error rather than system error, but dumping stack trace often gave
    operators the impression that there was a system failure.
  • Adds task id to Appenderator, AppenderatorDriver thread names. In
    the default log4j2 configuration, this will put them in log lines
    as well. It's very useful if a user is using the Indexer, where
    multiple tasks run in the same JVM.
  • More consistent terminology when it comes to "sequences" (sets of
    segments that are handed-off together by Kafka ingestion) and
    "offsets" (cursors in partitions). These terms had been confused in
    some log messages due to the fact that Kinesis calls offsets
    "sequence numbers".
  • Replaces some ugly toString calls with either the JSONification or
    something more operator-accessible (like a URL or segment identifier,
    instead of JSON object representing the same).

The goal of this patch is to improve the clarity and usefulness of
Druid's logging for cluster operators. For more information, see
https://twitter.com/cowtowncoder/status/1195469299814555648.

Concretely, this patch does the following:

- Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the
  goal of reducing redundancy and improving clarity by avoiding
  showing rarely-useful log messages. This includes most "starting"
  and "stopping" messages, and most messages related to individual
  columns.
- Adds new log4j2 templates that show operators how to enabled DEBUG
  logging for certain important packages.
- Eliminate stack traces for query errors, unless log level is DEBUG
  or more. This is useful because query errors often indicate user
  error rather than system error, but dumping stack trace often gave
  operators the impression that there was a system failure.
- Adds task id to Appenderator, AppenderatorDriver thread names. In
  the default log4j2 configuration, this will put them in log lines
  as well. It's very useful if a user is using the Indexer, where
  multiple tasks run in the same JVM.
- More consistent terminology when it comes to "sequences" (sets of
  segments that are handed-off together by Kafka ingestion) and
  "offsets" (cursors in partitions). These terms had been confused in
  some log messages due to the fact that Kinesis calls offsets
  "sequence numbers".
- Replaces some ugly toString calls with either the JSONification or
  something more operator-accessible (like a URL or segment identifier,
  instead of JSON object representing the same).
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Nov 17, 2019

The principles I'm trying to follow in this patch are:

  • Logs at INFO / WARN / ERROR level should be designed for cluster operators.
  • Logs at TRACE / DEBUG level should be designed for Druid developers.
  • Less is more. Having too many logs means the useful messages will get lost in a sea of irrelevant messages, because there will be too many to read and operators won't know what to search for.
  • Avoid logging "business as usual" things (starting, stopping, initializing, refreshing, found nothing to do, etc).
  • Ideally log one message per important lifecycle event (i.e. one log message per segment push to deep storage, one per publish, etc). Not zero and not more than one.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Nov 17, 2019

+3000

This is super useful

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Nov 19, 2019

This pull request introduces 2 alerts when merging b2d3d41 into f139903 - view on LGTM.com

new alerts:

  • 2 for Unused format argument

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Nov 19, 2019

Pushed updates to address some issues found by CI.

@fjy fjy merged commit c44452f into apache:master Nov 19, 2019
@gianm gianm deleted the operator-logging branch November 19, 2019 21:58
@gianm gianm added this to the 0.17.0 milestone Nov 19, 2019
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Nov 26, 2019
* Tidy up lifecycle, query, and ingestion logging.

The goal of this patch is to improve the clarity and usefulness of
Druid's logging for cluster operators. For more information, see
https://twitter.com/cowtowncoder/status/1195469299814555648.

Concretely, this patch does the following:

- Changes a lot of INFO logs to DEBUG, and DEBUG to TRACE, with the
  goal of reducing redundancy and improving clarity by avoiding
  showing rarely-useful log messages. This includes most "starting"
  and "stopping" messages, and most messages related to individual
  columns.
- Adds new log4j2 templates that show operators how to enabled DEBUG
  logging for certain important packages.
- Eliminate stack traces for query errors, unless log level is DEBUG
  or more. This is useful because query errors often indicate user
  error rather than system error, but dumping stack trace often gave
  operators the impression that there was a system failure.
- Adds task id to Appenderator, AppenderatorDriver thread names. In
  the default log4j2 configuration, this will put them in log lines
  as well. It's very useful if a user is using the Indexer, where
  multiple tasks run in the same JVM.
- More consistent terminology when it comes to "sequences" (sets of
  segments that are handed-off together by Kafka ingestion) and
  "offsets" (cursors in partitions). These terms had been confused in
  some log messages due to the fact that Kinesis calls offsets
  "sequence numbers".
- Replaces some ugly toString calls with either the JSONification or
  something more operator-accessible (like a URL or segment identifier,
  instead of JSON object representing the same).

* Adjustments.

* Adjust integration test.
@jon-wei jon-wei mentioned this pull request Dec 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants