Fix segment/unavailable/count metric when replicaCount=0 for load rules#16020
Fix segment/unavailable/count metric when replicaCount=0 for load rules#16020georgew5656 merged 1 commit intomasterfrom
Conversation
suneet-s
left a comment
There was a problem hiding this comment.
Very nice! 👍
Given that the test has a 60s timeout - could we add these changes to one of the existing tests to test this instead of adding a new test and extending the dead time that the unit tests have to run for? Maybe in #testCoordinatorTieredRun?
is that timeout a maximum or a static config? when i ran it locally the test went pretty fast, like 5 seconds. i did consider moving it into one of the existing test but I thought what I was testing was different enough that it deserved its own test (no segment assignment to historicals, etc) |
I tried running the test as part of the DruidCoordinatorTest and it actually only takes a few hundred millis when run as part of the test suite as opposed to on it's own, so a separate test is better. Thanks @georgew5656 ! |
| for (DataSegment segment : dataSegments) { | ||
| SegmentReplicaCount replicaCount = segmentReplicationStatus.getReplicaCountsInCluster(segment.getId()); | ||
| if (replicaCount != null && replicaCount.totalLoaded() > 0) { | ||
| if (replicaCount != null && (replicaCount.totalLoaded() > 0 || replicaCount.required() == 0)) { |
There was a problem hiding this comment.
Should we also document this change here :
/docs/operations/metrics.md#L333. Thinking more about it on how the doc would look, should we add a new metric type if replicaCount.required==0 ?
The reason I am saying this is that users on the regular sql/native endpoint will never see the data from used segment in deep storage ierequired==0 . They might get thrown off because of this.
I feel the current doc change for the metric fails to capture this nuance.
If we try to explain this nuance, it becomes complicated.
A new metric segment/deepStorageOnly/count would probably make more sense
no ?
Wdyt ?
There was a problem hiding this comment.
Yeah, I did have similar thoughts.
Following the approach used in this PR, no segment would ever truly be considered unavailable since it can always be queried using MSQ-query-from-deep-storage.
The source of the confusion is the definition of unavailable/count itself.
Does unavailable denote a segment
- that should be loaded but is not?
OR - that should be queryable but is not?
The metric underReplicated/count corresponds with option 1 and this PR makes unavailable/count use option 1 too.
The docs currently include both the flavors in the definition of this metric (which is correct for the native SQL engine):
Number of unique segments left to load until all used segments are available for queries.
@georgew5656 , to eliminate the confusion, as @cryptoe suggests, I think we should:
- keep this metric unchanged (in case there are downstream apps relying on its values being reported correctly for the native engine)
- add a new metric for MSQ cases
- update the new info in the docs
There was a problem hiding this comment.
i don't think this is changing the existing behavior afaik. before this change, if a segment was not loaded via a load rule it would not be included in segment/unavailable/count, now the behavior would be the same.
the difference would be that a segment that is not loaded onto a historical (hot rule) but is covered by a cold rule is no longer counted in segment/unavailable/count.
When a load rule specifies replicaCount=0, the affected segments will not be loaded onto historicals but they are queryable via the multi-stage-query engine. Currently these segments will be labeled as unavailable in segment/unavailable/count which seems wrong to me since that metric indicates that segments still need to be loaded.
The behavior is also different from segment/underReplicated/count since that metric does not include replicaCount=0 segments as unavailable.
Description
When checking whether to label a segment as unavailable for the purposes of the segment/unavailable/count metric, check whether the required replicas is 0, in which case the segment should not be marked as unavailable.
The segment/underReplicated/count already does this
Fixed the bug ...
Renamed the class ...
Added a forbidden-apis entry ...
This is a small one-line change to exclude non-historical segments from being marked as unavailable. I tested this on my local machine with the logging emitter and a test datasource.
Release note
Key changed/added classes in this PR
DruidCoordinatorThis PR has: