Skip to content

fix bug with broker parallel merge metrics emitting#13420

Merged
clintropolis merged 1 commit intoapache:masterfrom
clintropolis:like-actually-emit-broker-paralle-merge-metrics
Dec 7, 2022
Merged

fix bug with broker parallel merge metrics emitting#13420
clintropolis merged 1 commit intoapache:masterfrom
clintropolis:like-actually-emit-broker-paralle-merge-metrics

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Nov 23, 2022

Description

Heh, I left out the call to .emit on the broker parallel merge pool metrics so they just get reported into thin air. This PR fixes this so that they now exist and look something like this (when wired up to a QueryMetrics implementation that actually emits them, they are currently off by default):

2022-11-23T04:23:50,761 INFO [sql[e4429600-e50b-4fee-90f6-941c11204924]] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"numComplexMetrics":"0","mergeParallelism":2,"type":"timeseries","version":"","duration":"PT9223372036854775.807S","feed":"metrics","metric":"query/merge/parallelism","hasFilters":"false","service":"druid/broker","host":"localhost:8082","context":{"defaultTimeout":300000,"finalize":false,"grandTotal":false,"maxScatterGatherBytes":9223372036854775807,"populateCache":false,"queryFailTime":1669177730685,"queryId":"e4429600-e50b-4fee-90f6-941c11204924","sqlOuterLimit":1001,"sqlQueryId":"e4429600-e50b-4fee-90f6-941c11204924","useCache":false},"limit":"2147483647","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"id":"e4429600-e50b-4fee-90f6-941c11204924","numMetrics":"1","value":2,"dataSource":"wikipedia_rollup_test","timestamp":"2022-11-23T04:23:50.761Z"}
2022-11-23T04:23:50,761 INFO [sql[e4429600-e50b-4fee-90f6-941c11204924]] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"numComplexMetrics":"0","mergeParallelism":2,"type":"timeseries","version":"","duration":"PT9223372036854775.807S","feed":"metrics","metric":"query/merge/sequences","hasFilters":"false","service":"druid/broker","host":"localhost:8082","context":{"defaultTimeout":300000,"finalize":false,"grandTotal":false,"maxScatterGatherBytes":9223372036854775807,"populateCache":false,"queryFailTime":1669177730685,"queryId":"e4429600-e50b-4fee-90f6-941c11204924","sqlOuterLimit":1001,"sqlQueryId":"e4429600-e50b-4fee-90f6-941c11204924","useCache":false},"limit":"2147483647","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"id":"e4429600-e50b-4fee-90f6-941c11204924","numMetrics":"1","value":4,"dataSource":"wikipedia_rollup_test","timestamp":"2022-11-23T04:23:50.761Z"}
2022-11-23T04:23:50,761 INFO [sql[e4429600-e50b-4fee-90f6-941c11204924]] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"numComplexMetrics":"0","mergeParallelism":2,"type":"timeseries","version":"","duration":"PT9223372036854775.807S","feed":"metrics","metric":"query/merge/rows/input","hasFilters":"false","service":"druid/broker","host":"localhost:8082","context":{"defaultTimeout":300000,"finalize":false,"grandTotal":false,"maxScatterGatherBytes":9223372036854775807,"populateCache":false,"queryFailTime":1669177730685,"queryId":"e4429600-e50b-4fee-90f6-941c11204924","sqlOuterLimit":1001,"sqlQueryId":"e4429600-e50b-4fee-90f6-941c11204924","useCache":false},"limit":"2147483647","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"id":"e4429600-e50b-4fee-90f6-941c11204924","numMetrics":"1","value":4,"dataSource":"wikipedia_rollup_test","timestamp":"2022-11-23T04:23:50.761Z"}
2022-11-23T04:23:50,761 INFO [sql[e4429600-e50b-4fee-90f6-941c11204924]] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"numComplexMetrics":"0","mergeParallelism":2,"type":"timeseries","version":"","duration":"PT9223372036854775.807S","feed":"metrics","metric":"query/merge/rows/output","hasFilters":"false","service":"druid/broker","host":"localhost:8082","context":{"defaultTimeout":300000,"finalize":false,"grandTotal":false,"maxScatterGatherBytes":9223372036854775807,"populateCache":false,"queryFailTime":1669177730685,"queryId":"e4429600-e50b-4fee-90f6-941c11204924","sqlOuterLimit":1001,"sqlQueryId":"e4429600-e50b-4fee-90f6-941c11204924","useCache":false},"limit":"2147483647","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"id":"e4429600-e50b-4fee-90f6-941c11204924","numMetrics":"1","value":4,"dataSource":"wikipedia_rollup_test","timestamp":"2022-11-23T04:23:50.761Z"}
2022-11-23T04:23:50,761 INFO [sql[e4429600-e50b-4fee-90f6-941c11204924]] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"numComplexMetrics":"0","mergeParallelism":2,"type":"timeseries","version":"","duration":"PT9223372036854775.807S","feed":"metrics","metric":"query/merge/tasks/count","hasFilters":"false","service":"druid/broker","host":"localhost:8082","context":{"defaultTimeout":300000,"finalize":false,"grandTotal":false,"maxScatterGatherBytes":9223372036854775807,"populateCache":false,"queryFailTime":1669177730685,"queryId":"e4429600-e50b-4fee-90f6-941c11204924","sqlOuterLimit":1001,"sqlQueryId":"e4429600-e50b-4fee-90f6-941c11204924","useCache":false},"limit":"2147483647","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"id":"e4429600-e50b-4fee-90f6-941c11204924","numMetrics":"1","value":10,"dataSource":"wikipedia_rollup_test","timestamp":"2022-11-23T04:23:50.761Z"}
2022-11-23T04:23:50,761 INFO [sql[e4429600-e50b-4fee-90f6-941c11204924]] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"numComplexMetrics":"0","mergeParallelism":2,"type":"timeseries","version":"","duration":"PT9223372036854775.807S","feed":"metrics","metric":"query/merge/cpu","hasFilters":"false","service":"druid/broker","host":"localhost:8082","context":{"defaultTimeout":300000,"finalize":false,"grandTotal":false,"maxScatterGatherBytes":9223372036854775807,"populateCache":false,"queryFailTime":1669177730685,"queryId":"e4429600-e50b-4fee-90f6-941c11204924","sqlOuterLimit":1001,"sqlQueryId":"e4429600-e50b-4fee-90f6-941c11204924","useCache":false},"limit":"2147483647","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"id":"e4429600-e50b-4fee-90f6-941c11204924","numMetrics":"1","value":0,"dataSource":"wikipedia_rollup_test","timestamp":"2022-11-23T04:23:50.761Z"}
2022-11-23T04:23:50,761 INFO [sql[e4429600-e50b-4fee-90f6-941c11204924]] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"numComplexMetrics":"0","mergeParallelism":2,"type":"timeseries","version":"","duration":"PT9223372036854775.807S","feed":"metrics","metric":"query/merge/time","hasFilters":"false","service":"druid/broker","host":"localhost:8082","context":{"defaultTimeout":300000,"finalize":false,"grandTotal":false,"maxScatterGatherBytes":9223372036854775807,"populateCache":false,"queryFailTime":1669177730685,"queryId":"e4429600-e50b-4fee-90f6-941c11204924","sqlOuterLimit":1001,"sqlQueryId":"e4429600-e50b-4fee-90f6-941c11204924","useCache":false},"limit":"2147483647","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"id":"e4429600-e50b-4fee-90f6-941c11204924","numMetrics":"1","value":65,"dataSource":"wikipedia_rollup_test","timestamp":"2022-11-23T04:23:50.761Z"}
2022-11-23T04:23:50,761 INFO [sql[e4429600-e50b-4fee-90f6-941c11204924]] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"numComplexMetrics":"0","mergeParallelism":2,"type":"timeseries","version":"","duration":"PT9223372036854775.807S","feed":"metrics","metric":"query/merge/partition/slowest","hasFilters":"false","service":"druid/broker","host":"localhost:8082","context":{"defaultTimeout":300000,"finalize":false,"grandTotal":false,"maxScatterGatherBytes":9223372036854775807,"populateCache":false,"queryFailTime":1669177730685,"queryId":"e4429600-e50b-4fee-90f6-941c11204924","sqlOuterLimit":1001,"sqlQueryId":"e4429600-e50b-4fee-90f6-941c11204924","useCache":false},"limit":"2147483647","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"id":"e4429600-e50b-4fee-90f6-941c11204924","numMetrics":"1","value":63,"dataSource":"wikipedia_rollup_test","timestamp":"2022-11-23T04:23:50.761Z"}
2022-11-23T04:23:50,761 INFO [sql[e4429600-e50b-4fee-90f6-941c11204924]] org.apache.druid.java.util.emitter.core.LoggingEmitter - {"numComplexMetrics":"0","mergeParallelism":2,"type":"timeseries","version":"","duration":"PT9223372036854775.807S","feed":"metrics","metric":"query/merge/partition/fastest","hasFilters":"false","service":"druid/broker","host":"localhost:8082","context":{"defaultTimeout":300000,"finalize":false,"grandTotal":false,"maxScatterGatherBytes":9223372036854775807,"populateCache":false,"queryFailTime":1669177730685,"queryId":"e4429600-e50b-4fee-90f6-941c11204924","sqlOuterLimit":1001,"sqlQueryId":"e4429600-e50b-4fee-90f6-941c11204924","useCache":false},"limit":"2147483647","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"id":"e4429600-e50b-4fee-90f6-941c11204924","numMetrics":"1","value":40,"dataSource":"wikipedia_rollup_test","timestamp":"2022-11-23T04:23:50.761Z"}

I've also added three new metrics, one for total 'wall' time of the sequence from construction to being totally consume, and one each for the time spent waiting for the fastest and slowest partitions to initialize (fetch the first batch of results to begin work). The three of these should provide some additional insight into how time was spent by the tasks on the pool to accompany the existing set of metrics (which actually exist now).

Release note

  • Fix an issue which cause broker parallel merge pool metrics to not be emitted.
  • Added new broker query metric for total parallel merge processing 'wall' time
  • Added new broker query metrics for parallel merge pool time spent waiting for results of 'fastest' and 'slowest' partitions.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes Clint! Left some comments, also could you please add some tests for the new metrics you've added?

Comment thread processing/src/main/java/org/apache/druid/query/QueryMetrics.java
@clintropolis
Copy link
Copy Markdown
Member Author

I'm not sure it makes sense to add tests to make the coverage bot happy since these metrics are not emitted by default and tests seem pointless.

Specifically, its complaining about these two files:

Diff coverage statistics:
------------------------------------------------------------------------------
|     lines      |    branches    |   functions    |   path
------------------------------------------------------------------------------
|   0% (0/3)     | 100% (0/0)     |  44% (4/9)     | org/apache/druid/query/DefaultQueryMetrics.java
|   0% (0/3)     | 100% (0/0)     |  50% (2/4)     | org/apache/druid/query/search/DefaultSearchQueryMetrics.java
------------------------------------------------------------------------------

the first of which just has no-ops which do nothing, and the latter just delegate to the underlying metrics.

Copy link
Copy Markdown
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the coverage check failing just because of technicality

LGTM once the second phase passes. Thanks for initiating it

@clintropolis clintropolis merged commit 37d8833 into apache:master Dec 7, 2022
@clintropolis clintropolis deleted the like-actually-emit-broker-paralle-merge-metrics branch December 7, 2022 01:51
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
@cryptoe
Copy link
Copy Markdown
Contributor

cryptoe commented Sep 26, 2023

@clintropolis I guess these metrics are enabled by default if enableParallelMerge is set.
I did not understand this statement .

they are currently off by default

@cryptoe
Copy link
Copy Markdown
Contributor

cryptoe commented Sep 26, 2023

Okay for future readers, we did not enable the metrics by default:

@Override
public void parallelMergeParallelism(final int parallelism)
{
// Emit nothing by default.
}
since this generates a ton of metrics per query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants