Skip to content

[draft] Start publishing segment load metric from a separate monitor#18314

Closed
uds5501 wants to merge 14 commits intoapache:masterfrom
uds5501:segment_load_metric
Closed

[draft] Start publishing segment load metric from a separate monitor#18314
uds5501 wants to merge 14 commits intoapache:masterfrom
uds5501:segment_load_metric

Conversation

@uds5501
Copy link
Copy Markdown
Contributor

@uds5501 uds5501 commented Jul 23, 2025

Description

Create a SegmentDiscoveryStatsMonitor that takes care of segment discovery on broker for data sources. Adds a small POC in embedded tests as well.


Key changed/added classes in this PR
  • BrokerServerView
  • BrokerSegmentStatsMonitor

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@uds5501 uds5501 force-pushed the segment_load_metric branch from 1d22153 to 6d46baa Compare July 23, 2025 11:11
@uds5501 uds5501 marked this pull request as ready for review July 23, 2025 11:16
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch, @uds5501 !
I have left some suggestions.

Edit: Please add docs for the new monitor and metric as well.

@uds5501 uds5501 force-pushed the segment_load_metric branch from 6d46baa to f2a174a Compare July 24, 2025 10:28
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Left some minor suggestions.

Comment thread server/src/test/java/org/apache/druid/client/BrokerServerViewTest.java Outdated
Comment thread server/src/main/java/org/apache/druid/client/BrokerServerView.java Outdated
Comment thread server/src/test/java/org/apache/druid/client/BrokerServerViewTest.java Outdated
@uds5501
Copy link
Copy Markdown
Contributor Author

uds5501 commented Jul 25, 2025

Note - Something weird is happening with test_runIndexTask_forInlineDatasource(). The test when run by itself (until failure) works perfectly, however when ran as part of EmbeddedMariaDBMetadataStoreTest it's failing.

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Jul 25, 2025

@uds5501 , where does the EmbeddedMariaDBTest fail?
Did this start happening after the change or even before?

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Jul 25, 2025

@uds5501 , after taking a look at the failure, it seems that the sub-class tests are failing because they override the createCluster method. So, they don't have the required druid.monitoring.monitors property.

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only 1 blocking comment to de-duplicate segment counts.

Comment thread server/src/main/java/org/apache/druid/client/BrokerServerView.java Outdated
Comment thread server/src/main/java/org/apache/druid/client/BrokerServerView.java Outdated
Comment thread docs/configuration/index.md
Comment thread docs/operations/metrics.md Outdated
Comment thread docs/operations/metrics.md Outdated
Comment thread docs/operations/metrics.md Outdated
Comment on lines +499 to +501
return RowKey.with(Dimension.DATASOURCE, segment.getDataSource())
.with(Dimension.INTERVAL, String.valueOf(segment.getInterval()))
.and(Dimension.VERSION, segment.getVersion());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are now emitting an added/removed event per segment, rather than interval and version, we can just use the dimensions dataSource, server and description (which contains the full segment ID).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you expecting a full message like {server_name} service added/removed {segment_id} segment in the description?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just the segment id.
server should be a separate dimension on the row key

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +83 to +88
private final ConcurrentHashMap<RowKey, Long> totalSegmentAddCount = new ConcurrentHashMap<>();
private final ConcurrentHashMap<RowKey, Long> totalSegmentRemoveCount = new ConcurrentHashMap<>();
@GuardedBy("totalSegmentAddCount")
private Map<RowKey, Long> previousSegmentAddCount = new HashMap<>();
@GuardedBy("totalSegmentRemoveCount")
private Map<RowKey, Long> previousSegmentRemoveCount = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this, we can just keep a list of events (added, removed).
When getSegmentsAdded is called, we just flush out that list.
The count for any given row key will always be 1 if we use segment ID and server in the dimensions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I have pushed a non event version to verify embedded tests.
I am thinking if a simple private final ConcurrentHashMap<RowKey, List<DiscoveryEvent>> discoveryEvents will suffice for now, where RowKey has dataSource and server name.

The DiscoveryEvent could be a simple enum with ADDED / REMOVED values. This map will be flushed on each getSegmentDiscoveryEvents() fetch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are adding a discovery event class, that can contain the RowKey itself. Then you don't really need the map. You could just have a list of events, which feels more natural.

Drawback is that on the monitor side, you would need to distinguish added and removed events. So you would need to expose this DiscoveryEvent class, which seems unnecessary.

See if you can just make do with a Boolean, or even keeping separate methods and lists for added/removed is fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, yeah I think I can maintain separate added and removed RowKeys for now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@uds5501 uds5501 force-pushed the segment_load_metric branch from b870d54 to e3a48f3 Compare July 28, 2025 11:08
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left minor suggestions.

event -> event.hasDimension(DruidMetrics.DATASOURCE, dataSource)
event -> event.hasMetricName("serverview/segment/added")
.hasDimension(DruidMetrics.DATASOURCE, dataSource)
.hasValue(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the value of this metric is always 1, so we can remove this condition.

DUTY_GROUP("dutyGroup"),
DESCRIPTION("description"),
SERVER("server");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change.

.collect(Collectors.toList());
}

private static RowKey getMetricKey(final DataSegment segment, DruidServerMetadata serverMetadata)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please either keep both args final or both non-final.

@uds5501 uds5501 changed the title Start publishing segment load metric from a separate monitor [draft] Start publishing segment load metric from a separate monitor Jul 29, 2025
@uds5501 uds5501 marked this pull request as draft July 29, 2025 08:21
@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Jul 31, 2025

@uds5501 , I think we can close this off for now since the flakiness in the tests has already been addressed.

@uds5501 uds5501 closed this Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants