Add a new metric query/segments/count that is not emitted by default by capistrant · Pull Request #11394 · apache/druid

capistrant · 2021-06-29T22:12:45Z

Start Release Notes
Add the plumbing for a new query metric called query/segments/count. This metric is emitted by the broker and contains the value of the number of segments that the query is going to hit. This metric is not enabled by default. To enable the metric, you need to extend QueryMetrics and implement QueryMetrics<QueryType> reportQueriedSegmentCount(long segmentCount) See the QueryMetrics javadoc for more info on how to enable this metric: javadoc
End Release Notes

Description

Add the plumbing for a new query metric, query/segments/count that is not emitted by default. The value of this metric is the number of segments that a query will touch. My team has found it as an easy way to do analytics on the reach of queries running on our multi-tenant clusters. You could get this information via query metrics that are related to individual segments, but this is a more concise way to retrieve the information. My team has clusters with millions of segments and have made the decision to forgo segment level metric emission due to the cost. I assume other large clusters would feel the same about having these metrics without having the segment level metrics in their feeds. This new metric was especially helpful as we worked on identifying what we should use as a segment threshold for priority reduction on our clusters.

An open question is how to document non-emitted metrics so people know the plumbing is there if they want to emit them.

The metric is disabled by default. I made this decision because there are discussions in the past that advocate for not adding default metrics to the code base. One of the main arguments for this strategy is that new metrics increase volume and increased volume will increase costs incurred by cluster operators. This can be very real dollar figures when it comes to cloud deployments paying for message queues and druid storage/compute. Therefore, enabling the metric requires either patching the code in DefaultQueryMetrics or Overriding QueryMetrics in their own fork.

Here is the discussion regarding query metric emission by default: link

Key changed/added classes in this PR

QueryMetrics
CachingClusteredClient

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

abhishekagarwal87 · 2021-06-30T11:57:55Z

FWIW, I think it's a useful enough metric to be added to the default set. It will let you see the spread of queries in terms of #segments. If it's not enabled by default, how is the metric going to be enabled? I glanced through the linked discussion and it seems configuration is the recommended option. is that the plan or do you already have a custom extension for metrics?

capistrant · 2021-07-01T22:24:08Z

FWIW, I think it's a useful enough metric to be added to the default set. It will let you see the spread of queries in terms of #segments. If it's not enabled by default, how is the metric going to be enabled? I glanced through the linked discussion and it seems configuration is the recommended option. is that the plan or do you already have a custom extension for metrics?

As of this PR, the operator would either need to edit the code and build from source or add an extension. Neither of which is a user friendly approach. But going the configuration route would be a major undertaking since important design decisions will need to be made and reviewed to make sure it is setup in a way that makes sense. Other metrics such as, https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java#L262, are setup like this PR is right now

jihoonson · 2021-07-08T04:21:23Z

      query = scheduler.prioritizeAndLaneQuery(queryPlus, segmentServers);
      queryPlus = queryPlus.withQuery(query);
+      queryPlus = queryPlus.withQueryMetrics(toolChest);
+      queryPlus.getQueryMetrics().reportQueriedSegmentCount(segmentServers.size()).emit(emitter);


RetryQueryRunner is responsible for re-routing the query to new homes of segments if they are moved during query processing. It uses CachingClusteredClient for re-routing. As a result, this can report the same segments multiple times if those segments are moved. I think this is OK and worth documenting.

@jihoonson That retry code is unfamiliar to me at this time, Number of segments that will be touched by the query. If the query has to be retried, the metric will be reported for all retries as well as the original query. Is that a sane description of the metric in the metrics.md file?

… emissions

jihoonson · 2021-07-15T18:25:12Z

 |`query/interrupted/count`|number of queries interrupted due to cancellation.|This metric is only available if the QueryCountStatsMonitor module is included.||
 |`query/timeout/count`|number of timed out queries.|This metric is only available if the QueryCountStatsMonitor module is included.||
-|`query/segments/count`|This query is not enabled by default. See the `QueryMetrics` Interface for reference regarding enabling this metric. Number of segments that will be touched by the query.|Varies.|
+|`query/segments/count`|This query is not enabled by default. See the `QueryMetrics` Interface for reference regarding enabling this metric. Number of segments that will be touched by the query. If the query has to be retried, the metric will be reported for all retries as well as the original query.|Varies.|


I tried to add some more details. Please feel free to edit it if you like.

In the broker, it makes a plan to distribute the query to realtime tasks and historicals based on a snapshot of segment distribution state. If there are some segments moved after this snapshot is created, certain historicals and realtime tasks can report those segments as missing to the broker. The broker will re-send the query to the new servers that serve those segments after move. In this case, those segments can be counted more than once in this metric.

I like that wording

jihoonson

@capistrant thanks for updating the PR. LGTM.

Add a new metric query/segments/count that is not emitted by default

49c51cb

clintropolis added Area - Metrics/Event Emitting Area - Operations labels Jul 1, 2021

jihoonson added the Release Notes label Jul 1, 2021

abhishekagarwal87 approved these changes Jul 5, 2021

View reviewed changes

Lucas.Capistrant added 3 commits July 6, 2021 18:06

docs

ce43dcd

test the default implementation of the metric

08032ab

fix spelling error in docs

f1b7e71

jihoonson reviewed Jul 8, 2021

View reviewed changes

document the fact that query retries will result in additional metric…

90e1e8b

… emissions

jihoonson reviewed Jul 15, 2021

View reviewed changes

update using recommended text from @jihoonson

ca242ad

jihoonson approved these changes Jul 21, 2021

View reviewed changes

jihoonson merged commit 9767b42 into apache:master Jul 23, 2021

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new metric query/segments/count that is not emitted by default#11394

Add a new metric query/segments/count that is not emitted by default#11394
jihoonson merged 6 commits intoapache:masterfrom
capistrant:implement-skeleton-for-new-query-metric

capistrant commented Jun 29, 2021 •

edited

Loading

Uh oh!

abhishekagarwal87 commented Jun 30, 2021

Uh oh!

capistrant commented Jul 1, 2021

Uh oh!

jihoonson Jul 8, 2021

Uh oh!

capistrant Jul 8, 2021

Uh oh!

jihoonson Jul 15, 2021

Uh oh!

capistrant Jul 21, 2021

Uh oh!

jihoonson left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

capistrant commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changed/added classes in this PR

Uh oh!

abhishekagarwal87 commented Jun 30, 2021

Uh oh!

capistrant commented Jul 1, 2021

Uh oh!

jihoonson Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

capistrant Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

jihoonson Jul 15, 2021

Choose a reason for hiding this comment

Uh oh!

capistrant Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

capistrant commented Jun 29, 2021 •

edited

Loading