Fix segment/unavailable/count metric when replicaCount=0 for load rules by georgew5656 · Pull Request #16020 · apache/druid

georgew5656 · 2024-03-01T15:52:16Z

When a load rule specifies replicaCount=0, the affected segments will not be loaded onto historicals but they are queryable via the multi-stage-query engine. Currently these segments will be labeled as unavailable in segment/unavailable/count which seems wrong to me since that metric indicates that segments still need to be loaded.

The behavior is also different from segment/underReplicated/count since that metric does not include replicaCount=0 segments as unavailable.

Description

When checking whether to label a segment as unavailable for the purposes of the segment/unavailable/count metric, check whether the required replicas is 0, in which case the segment should not be marked as unavailable.

The segment/underReplicated/count already does this

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

This is a small one-line change to exclude non-historical segments from being marked as unavailable. I tested this on my local machine with the logging emitter and a test datasource.

Release note

Clarify behavior of segment/unavailable/count and query from deep storage feature

Key changed/added classes in this PR

DruidCoordinator

This PR has:

suneet-s

Very nice! 👍

Given that the test has a 60s timeout - could we add these changes to one of the existing tests to test this instead of adding a new test and extending the dead time that the unit tests have to run for? Maybe in #testCoordinatorTieredRun?

georgew5656 · 2024-03-01T19:28:15Z

Very nice! 👍

Given that the test has a 60s timeout - could we add these changes to one of the existing tests to test this instead of adding a new test and extending the dead time that the unit tests have to run for? Maybe in #testCoordinatorTieredRun?

is that timeout a maximum or a static config? when i ran it locally the test went pretty fast, like 5 seconds.

i did consider moving it into one of the existing test but I thought what I was testing was different enough that it deserved its own test (no segment assignment to historicals, etc)

suneet-s · 2024-03-01T19:59:21Z

is that timeout a maximum or a static config? when i ran it locally the test went pretty fast, like 5 seconds.

i did consider moving it into one of the existing test but I thought what I was testing was different enough that it deserved its own test (no segment assignment to historicals, etc)

I tried running the test as part of the DruidCoordinatorTest and it actually only takes a few hundred millis when run as part of the test suite as opposed to on it's own, so a separate test is better. Thanks @georgew5656 !

cryptoe · 2024-03-05T09:53:21Z

    for (DataSegment segment : dataSegments) {
      SegmentReplicaCount replicaCount = segmentReplicationStatus.getReplicaCountsInCluster(segment.getId());
-      if (replicaCount != null && replicaCount.totalLoaded() > 0) {
+      if (replicaCount != null && (replicaCount.totalLoaded() > 0 || replicaCount.required() == 0)) {


Should we also document this change here :
/docs/operations/metrics.md#L333. Thinking more about it on how the doc would look, should we add a new metric type if replicaCount.required==0 ?
The reason I am saying this is that users on the regular sql/native endpoint will never see the data from used segment in deep storage ierequired==0 . They might get thrown off because of this.

I feel the current doc change for the metric fails to capture this nuance.
If we try to explain this nuance, it becomes complicated.
A new metric segment/deepStorageOnly/count would probably make more sense
no ?
Wdyt ?

Yeah, I did have similar thoughts.

Following the approach used in this PR, no segment would ever truly be considered unavailable since it can always be queried using MSQ-query-from-deep-storage.

The source of the confusion is the definition of unavailable/count itself.
Does unavailable denote a segment

that should be loaded but is not?
OR

that should be queryable but is not?

The metric underReplicated/count corresponds with option 1 and this PR makes unavailable/count use option 1 too.

The docs currently include both the flavors in the definition of this metric (which is correct for the native SQL engine):

Number of unique segments left to load until all used segments are available for queries.

@georgew5656 , to eliminate the confusion, as @cryptoe suggests, I think we should:

keep this metric unchanged (in case there are downstream apps relying on its values being reported correctly for the native engine)

add a new metric for MSQ cases

update the new info in the docs

i don't think this is changing the existing behavior afaik. before this change, if a segment was not loaded via a load rule it would not be included in segment/unavailable/count, now the behavior would be the same.

the difference would be that a segment that is not loaded onto a historical (hot rule) but is covered by a cold rule is no longer counted in segment/unavailable/count.

Fix segment/unavailable/count

fd2f7bb

georgew5656 requested review from abhishekrb19 and suneet-s March 1, 2024 15:52

georgew5656 assigned suneet-s Mar 1, 2024

suneet-s approved these changes Mar 1, 2024

View reviewed changes

georgew5656 merged commit ef48ace into master Mar 1, 2024

kfaraz deleted the fixSegmentUnavailableMetric branch March 2, 2024 02:24

cryptoe reviewed Mar 5, 2024

View reviewed changes

georgew5656 mentioned this pull request Mar 7, 2024

Add deep storage segment metric #16072

Merged

10 tasks

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

adarshsanjeev mentioned this pull request May 28, 2024

[DRAFT] 30.0.0 release notes #16505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix segment/unavailable/count metric when replicaCount=0 for load rules#16020

Fix segment/unavailable/count metric when replicaCount=0 for load rules#16020
georgew5656 merged 1 commit intomasterfrom
fixSegmentUnavailableMetric

georgew5656 commented Mar 1, 2024

Uh oh!

suneet-s left a comment

Uh oh!

georgew5656 commented Mar 1, 2024

Uh oh!

suneet-s commented Mar 1, 2024

Uh oh!

cryptoe Mar 5, 2024 •

edited

Loading

Uh oh!

kfaraz Mar 5, 2024

Uh oh!

georgew5656 Mar 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

georgew5656 commented Mar 1, 2024

Description

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note

Key changed/added classes in this PR

Uh oh!

suneet-s left a comment

Choose a reason for hiding this comment

Uh oh!

georgew5656 commented Mar 1, 2024

Uh oh!

suneet-s commented Mar 1, 2024

Uh oh!

cryptoe Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz Mar 5, 2024

Choose a reason for hiding this comment

Uh oh!

georgew5656 Mar 5, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cryptoe Mar 5, 2024 •

edited

Loading