Skip to content

Fix segment/unavailable/count metric when replicaCount=0 for load rules#16020

Merged
georgew5656 merged 1 commit intomasterfrom
fixSegmentUnavailableMetric
Mar 1, 2024
Merged

Fix segment/unavailable/count metric when replicaCount=0 for load rules#16020
georgew5656 merged 1 commit intomasterfrom
fixSegmentUnavailableMetric

Conversation

@georgew5656
Copy link
Copy Markdown
Contributor

When a load rule specifies replicaCount=0, the affected segments will not be loaded onto historicals but they are queryable via the multi-stage-query engine. Currently these segments will be labeled as unavailable in segment/unavailable/count which seems wrong to me since that metric indicates that segments still need to be loaded.

The behavior is also different from segment/underReplicated/count since that metric does not include replicaCount=0 segments as unavailable.

Description

When checking whether to label a segment as unavailable for the purposes of the segment/unavailable/count metric, check whether the required replicas is 0, in which case the segment should not be marked as unavailable.

The segment/underReplicated/count already does this

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

This is a small one-line change to exclude non-historical segments from being marked as unavailable. I tested this on my local machine with the logging emitter and a test datasource.

Release note

  • Clarify behavior of segment/unavailable/count and query from deep storage feature
Key changed/added classes in this PR
  • DruidCoordinator

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! 👍

Given that the test has a 60s timeout - could we add these changes to one of the existing tests to test this instead of adding a new test and extending the dead time that the unit tests have to run for? Maybe in #testCoordinatorTieredRun?

@georgew5656
Copy link
Copy Markdown
Contributor Author

Very nice! 👍

Given that the test has a 60s timeout - could we add these changes to one of the existing tests to test this instead of adding a new test and extending the dead time that the unit tests have to run for? Maybe in #testCoordinatorTieredRun?

is that timeout a maximum or a static config? when i ran it locally the test went pretty fast, like 5 seconds.

i did consider moving it into one of the existing test but I thought what I was testing was different enough that it deserved its own test (no segment assignment to historicals, etc)

@suneet-s
Copy link
Copy Markdown
Contributor

suneet-s commented Mar 1, 2024

is that timeout a maximum or a static config? when i ran it locally the test went pretty fast, like 5 seconds.

i did consider moving it into one of the existing test but I thought what I was testing was different enough that it deserved its own test (no segment assignment to historicals, etc)

I tried running the test as part of the DruidCoordinatorTest and it actually only takes a few hundred millis when run as part of the test suite as opposed to on it's own, so a separate test is better. Thanks @georgew5656 !

@georgew5656 georgew5656 merged commit ef48ace into master Mar 1, 2024
@kfaraz kfaraz deleted the fixSegmentUnavailableMetric branch March 2, 2024 02:24
for (DataSegment segment : dataSegments) {
SegmentReplicaCount replicaCount = segmentReplicationStatus.getReplicaCountsInCluster(segment.getId());
if (replicaCount != null && replicaCount.totalLoaded() > 0) {
if (replicaCount != null && (replicaCount.totalLoaded() > 0 || replicaCount.required() == 0)) {
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also document this change here :
/docs/operations/metrics.md#L333. Thinking more about it on how the doc would look, should we add a new metric type if replicaCount.required==0 ?
The reason I am saying this is that users on the regular sql/native endpoint will never see the data from used segment in deep storage ierequired==0 . They might get thrown off because of this.

I feel the current doc change for the metric fails to capture this nuance.
If we try to explain this nuance, it becomes complicated.
A new metric segment/deepStorageOnly/count would probably make more sense
no ?
Wdyt ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did have similar thoughts.

Following the approach used in this PR, no segment would ever truly be considered unavailable since it can always be queried using MSQ-query-from-deep-storage.

The source of the confusion is the definition of unavailable/count itself.
Does unavailable denote a segment

  1. that should be loaded but is not?
    OR
  2. that should be queryable but is not?

The metric underReplicated/count corresponds with option 1 and this PR makes unavailable/count use option 1 too.

The docs currently include both the flavors in the definition of this metric (which is correct for the native SQL engine):

Number of unique segments left to load until all used segments are available for queries.

@georgew5656 , to eliminate the confusion, as @cryptoe suggests, I think we should:

  • keep this metric unchanged (in case there are downstream apps relying on its values being reported correctly for the native engine)
  • add a new metric for MSQ cases
  • update the new info in the docs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is changing the existing behavior afaik. before this change, if a segment was not loaded via a load rule it would not be included in segment/unavailable/count, now the behavior would be the same.

the difference would be that a segment that is not loaded onto a historical (hot rule) but is covered by a cold rule is no longer counted in segment/unavailable/count.

@georgew5656 georgew5656 mentioned this pull request Mar 7, 2024
10 tasks
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants