Add a new metric query/segments/count that is not emitted by default#11394
Conversation
|
FWIW, I think it's a useful enough metric to be added to the default set. It will let you see the spread of queries in terms of #segments. If it's not enabled by default, how is the metric going to be enabled? I glanced through the linked discussion and it seems configuration is the recommended option. is that the plan or do you already have a custom extension for metrics? |
As of this PR, the operator would either need to edit the code and build from source or add an extension. Neither of which is a user friendly approach. But going the configuration route would be a major undertaking since important design decisions will need to be made and reviewed to make sure it is setup in a way that makes sense. Other metrics such as, https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java#L262, are setup like this PR is right now |
| query = scheduler.prioritizeAndLaneQuery(queryPlus, segmentServers); | ||
| queryPlus = queryPlus.withQuery(query); | ||
| queryPlus = queryPlus.withQueryMetrics(toolChest); | ||
| queryPlus.getQueryMetrics().reportQueriedSegmentCount(segmentServers.size()).emit(emitter); |
There was a problem hiding this comment.
RetryQueryRunner is responsible for re-routing the query to new homes of segments if they are moved during query processing. It uses CachingClusteredClient for re-routing. As a result, this can report the same segments multiple times if those segments are moved. I think this is OK and worth documenting.
There was a problem hiding this comment.
@jihoonson That retry code is unfamiliar to me at this time, Number of segments that will be touched by the query. If the query has to be retried, the metric will be reported for all retries as well as the original query. Is that a sane description of the metric in the metrics.md file?
| |`query/interrupted/count`|number of queries interrupted due to cancellation.|This metric is only available if the QueryCountStatsMonitor module is included.|| | ||
| |`query/timeout/count`|number of timed out queries.|This metric is only available if the QueryCountStatsMonitor module is included.|| | ||
| |`query/segments/count`|This query is not enabled by default. See the `QueryMetrics` Interface for reference regarding enabling this metric. Number of segments that will be touched by the query.|Varies.| | ||
| |`query/segments/count`|This query is not enabled by default. See the `QueryMetrics` Interface for reference regarding enabling this metric. Number of segments that will be touched by the query. If the query has to be retried, the metric will be reported for all retries as well as the original query.|Varies.| |
There was a problem hiding this comment.
I tried to add some more details. Please feel free to edit it if you like.
In the broker, it makes a plan to distribute the query to realtime tasks and historicals based on a snapshot of segment distribution state. If there are some segments moved after this snapshot is created, certain historicals and realtime tasks can report those segments as missing to the broker. The broker will re-send the query to the new servers that serve those segments after move. In this case, those segments can be counted more than once in this metric.
There was a problem hiding this comment.
I like that wording
jihoonson
left a comment
There was a problem hiding this comment.
@capistrant thanks for updating the PR. LGTM.
Start Release Notes
Add the plumbing for a new query metric called
query/segments/count. This metric is emitted by the broker and contains the value of the number of segments that the query is going to hit. This metric is not enabled by default. To enable the metric, you need to extendQueryMetricsand implementQueryMetrics<QueryType> reportQueriedSegmentCount(long segmentCount)See theQueryMetricsjavadoc for more info on how to enable this metric: javadocEnd Release Notes
Description
Add the plumbing for a new query metric,
query/segments/countthat is not emitted by default. The value of this metric is the number of segments that a query will touch. My team has found it as an easy way to do analytics on the reach of queries running on our multi-tenant clusters. You could get this information via query metrics that are related to individual segments, but this is a more concise way to retrieve the information. My team has clusters with millions of segments and have made the decision to forgo segment level metric emission due to the cost. I assume other large clusters would feel the same about having these metrics without having the segment level metrics in their feeds. This new metric was especially helpful as we worked on identifying what we should use as a segment threshold for priority reduction on our clusters.An open question is how to document non-emitted metrics so people know the plumbing is there if they want to emit them.
The metric is disabled by default. I made this decision because there are discussions in the past that advocate for not adding default metrics to the code base. One of the main arguments for this strategy is that new metrics increase volume and increased volume will increase costs incurred by cluster operators. This can be very real dollar figures when it comes to cloud deployments paying for message queues and druid storage/compute. Therefore, enabling the metric requires either patching the code in DefaultQueryMetrics or Overriding QueryMetrics in their own fork.
Here is the discussion regarding query metric emission by default: link
Key changed/added classes in this PR
QueryMetricsCachingClusteredClientThis PR has: