Skip to content

Add a new metric query/segments/count that is not emitted by default#11394

Merged
jihoonson merged 6 commits intoapache:masterfrom
capistrant:implement-skeleton-for-new-query-metric
Jul 23, 2021
Merged

Add a new metric query/segments/count that is not emitted by default#11394
jihoonson merged 6 commits intoapache:masterfrom
capistrant:implement-skeleton-for-new-query-metric

Conversation

@capistrant
Copy link
Copy Markdown
Contributor

@capistrant capistrant commented Jun 29, 2021

Start Release Notes
Add the plumbing for a new query metric called query/segments/count. This metric is emitted by the broker and contains the value of the number of segments that the query is going to hit. This metric is not enabled by default. To enable the metric, you need to extend QueryMetrics and implement QueryMetrics<QueryType> reportQueriedSegmentCount(long segmentCount) See the QueryMetrics javadoc for more info on how to enable this metric: javadoc
End Release Notes

Description

Add the plumbing for a new query metric, query/segments/count that is not emitted by default. The value of this metric is the number of segments that a query will touch. My team has found it as an easy way to do analytics on the reach of queries running on our multi-tenant clusters. You could get this information via query metrics that are related to individual segments, but this is a more concise way to retrieve the information. My team has clusters with millions of segments and have made the decision to forgo segment level metric emission due to the cost. I assume other large clusters would feel the same about having these metrics without having the segment level metrics in their feeds. This new metric was especially helpful as we worked on identifying what we should use as a segment threshold for priority reduction on our clusters.

An open question is how to document non-emitted metrics so people know the plumbing is there if they want to emit them.

The metric is disabled by default. I made this decision because there are discussions in the past that advocate for not adding default metrics to the code base. One of the main arguments for this strategy is that new metrics increase volume and increased volume will increase costs incurred by cluster operators. This can be very real dollar figures when it comes to cloud deployments paying for message queues and druid storage/compute. Therefore, enabling the metric requires either patching the code in DefaultQueryMetrics or Overriding QueryMetrics in their own fork.

Here is the discussion regarding query metric emission by default: link


Key changed/added classes in this PR
  • QueryMetrics
  • CachingClusteredClient

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@abhishekagarwal87
Copy link
Copy Markdown
Contributor

FWIW, I think it's a useful enough metric to be added to the default set. It will let you see the spread of queries in terms of #segments. If it's not enabled by default, how is the metric going to be enabled? I glanced through the linked discussion and it seems configuration is the recommended option. is that the plan or do you already have a custom extension for metrics?

@capistrant
Copy link
Copy Markdown
Contributor Author

FWIW, I think it's a useful enough metric to be added to the default set. It will let you see the spread of queries in terms of #segments. If it's not enabled by default, how is the metric going to be enabled? I glanced through the linked discussion and it seems configuration is the recommended option. is that the plan or do you already have a custom extension for metrics?

As of this PR, the operator would either need to edit the code and build from source or add an extension. Neither of which is a user friendly approach. But going the configuration route would be a major undertaking since important design decisions will need to be made and reviewed to make sure it is setup in a way that makes sense. Other metrics such as, https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java#L262, are setup like this PR is right now

query = scheduler.prioritizeAndLaneQuery(queryPlus, segmentServers);
queryPlus = queryPlus.withQuery(query);
queryPlus = queryPlus.withQueryMetrics(toolChest);
queryPlus.getQueryMetrics().reportQueriedSegmentCount(segmentServers.size()).emit(emitter);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RetryQueryRunner is responsible for re-routing the query to new homes of segments if they are moved during query processing. It uses CachingClusteredClient for re-routing. As a result, this can report the same segments multiple times if those segments are moved. I think this is OK and worth documenting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson That retry code is unfamiliar to me at this time, Number of segments that will be touched by the query. If the query has to be retried, the metric will be reported for all retries as well as the original query. Is that a sane description of the metric in the metrics.md file?

Comment thread docs/operations/metrics.md Outdated
|`query/interrupted/count`|number of queries interrupted due to cancellation.|This metric is only available if the QueryCountStatsMonitor module is included.||
|`query/timeout/count`|number of timed out queries.|This metric is only available if the QueryCountStatsMonitor module is included.||
|`query/segments/count`|This query is not enabled by default. See the `QueryMetrics` Interface for reference regarding enabling this metric. Number of segments that will be touched by the query.|Varies.|
|`query/segments/count`|This query is not enabled by default. See the `QueryMetrics` Interface for reference regarding enabling this metric. Number of segments that will be touched by the query. If the query has to be retried, the metric will be reported for all retries as well as the original query.|Varies.|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add some more details. Please feel free to edit it if you like.

In the broker, it makes a plan to distribute the query to realtime tasks and historicals based on a snapshot of segment distribution state. If there are some segments moved after this snapshot is created, certain historicals and realtime tasks can report those segments as missing to the broker. The broker will re-send the query to the new servers that serve those segments after move. In this case, those segments can be counted more than once in this metric.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that wording

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@capistrant thanks for updating the PR. LGTM.

@jihoonson jihoonson merged commit 9767b42 into apache:master Jul 23, 2021
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants