Add ServiceStatusMonitor to monitor service health#14443
Add ServiceStatusMonitor to monitor service health#14443suneet-s merged 12 commits intoapache:masterfrom
Conversation
|
Thanks for your first contribution to Druid @YongGang ! Instead of introducing one monitor per service, I'd recommend creating a single monitor that can be installed on every service. This will make it easier to use as an operator can set this monitor in the common runtime properties instead of having to configure it per service. Instead of the We do not need to implement all these ideas in this PR, but I think a heartbeat metric will be more flexible than a metric that is scoped to the leader. |
| public boolean doMonitor(ServiceEmitter emitter) { | ||
| final ServiceMetricEvent.Builder builder = new ServiceMetricEvent.Builder(); | ||
|
|
||
| builder.setDimension("serviceType", "overlord"); |
There was a problem hiding this comment.
Since the monitor is using the ServiceEmitter, the service and host will be added auto-magically :) So this is not needed.
|
Thanks for the review @suneet-s . |
a4f4574 to
6345857
Compare
My recommendation would be to remove the |
|
Updated to use |
| |`org.apache.druid.server.metrics.TaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting and also the number of successful/failed tasks per emission period.| | ||
| |`org.apache.druid.server.metrics.TaskSlotCountStatsMonitor`|Reports metrics about task slot usage per emission period.| | ||
| |`org.apache.druid.server.metrics.WorkerTaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting, the number of successful/failed tasks, and metrics about task slot usage for the reporting worker, per emission period. Only supported by middleManager node types.| | ||
| | Name | Description | |
There was a problem hiding this comment.
I think the original formatting aligns better with the rest of the Druid docs.
|
|
||
| ### Overlord/Coordinator | ||
|
|
||
| | Metric | Description | Dimensions | Normal Value | |
There was a problem hiding this comment.
Please use the formatting style used in the rest of the Druid docs.
There was a problem hiding this comment.
Updated. Intellij keep reformatting the doc, interesting.
| return true; | ||
| } | ||
|
|
||
| heartbeatTagsSupplier.get().forEach((k, v) -> { |
There was a problem hiding this comment.
If I am not mistaken, the tags should each be a separate dimension and not be emitted as metric values.
The metric value will always be 1, as it is a simple count.
In the current code, you would be emitting the druid/heartbeat metric multiple times in every invocation of doMonitor.
There was a problem hiding this comment.
Not sure I understand.
For Overlord/Coordinator, this druid/heartbeat metric will be 1 with heartbeatType dimension set to leader. And since this map Map<String, Number> only have one entry, so the metric only reported once per doMonitor call.
For other potential service (or component within the service) to use this monitor, druid/heartbeat metric doesn't have to be 1, that's why heartbeatType dimension is introduced as the metric can have different meaning for different heartbeatType.
There was a problem hiding this comment.
I think what Kashif is mentioning here is that heartbeatTagsSupplier should be a map of dimension keys to values and the metric that is reported for the heartbeat is always a constant.
There was a problem hiding this comment.
Thanks, updated the code with dimensions configured.
| Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric")); | ||
| Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value")); | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Nit: newline at end of file.
| return true; | ||
| } | ||
|
|
||
| heartbeatTagsSupplier.get().forEach((k, v) -> { |
There was a problem hiding this comment.
Please add a null check on the supplier, just in case someone wants to use this monitor for other services, where the tags supplier is not being injected.
There was a problem hiding this comment.
To make the heartbeat work on other services, I think the pattern in this monitor should be
ServiceMetricEventBuilder builder = ...;
if (heartbeatDimensions is not empty) {
heartbeatDimensions.forEach(builder.setDimension(k, v);
}
emitter.emit(builder.build("druid/heartbeat", 1);
| Supplier<Map<String, Number>> heartbeatTagsSupplier = null; | ||
|
|
||
| @Inject | ||
| public ServiceStatusMonitor() { |
There was a problem hiding this comment.
Is the empty constructor needed?
|
|
||
| ## General Health | ||
|
|
||
| ### Overlord/Coordinator |
There was a problem hiding this comment.
I think it can be used for all services, even though it might not be very useful right now except for coordinator and overlord.
suneet-s
left a comment
There was a problem hiding this comment.
I think we're closer now.
Can you please update the description of the PR to reflect the new design and include examples of what the metrics look like once the feedback is incorporated.
| return true; | ||
| } | ||
|
|
||
| heartbeatTagsSupplier.get().forEach((k, v) -> { |
There was a problem hiding this comment.
I think what Kashif is mentioning here is that heartbeatTagsSupplier should be a map of dimension keys to values and the metric that is reported for the heartbeat is always a constant.
| return true; | ||
| } | ||
|
|
||
| heartbeatTagsSupplier.get().forEach((k, v) -> { |
There was a problem hiding this comment.
To make the heartbeat work on other services, I think the pattern in this monitor should be
ServiceMetricEventBuilder builder = ...;
if (heartbeatDimensions is not empty) {
heartbeatDimensions.forEach(builder.setDimension(k, v);
}
emitter.emit(builder.build("druid/heartbeat", 1);
|
Updated the PR description to reflect the new design. |
suneet-s
left a comment
There was a problem hiding this comment.
Looks good once the docs + tests are updated!
|
|
||
| |Metric|Description|Dimensions|Normal Value| | ||
| |------|-----------|----------|------------| | ||
| |`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1| |
There was a problem hiding this comment.
Stale doc.
| |`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1| | |
| |`druid/heartbeat`| Metric indicating the service is up. `ServiceStatusMonitor` must be enabled. |`leader` on the Overlord and Coordinator.|1| |
| |`org.apache.druid.server.metrics.TaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting and also the number of successful/failed tasks per emission period.| | ||
| |`org.apache.druid.server.metrics.TaskSlotCountStatsMonitor`|Reports metrics about task slot usage per emission period.| | ||
| |`org.apache.druid.server.metrics.WorkerTaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting, the number of successful/failed tasks, and metrics about task slot usage for the reporting worker, per emission period. Only supported by middleManager node types.| | ||
| | `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports service heartbeat. For overlord/coordinator, the number is leader count. Only supported by overlord/coordinator node types.| |
There was a problem hiding this comment.
Stale doc
| | `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports service heartbeat. For overlord/coordinator, the number is leader count. Only supported by overlord/coordinator node types.| | |
| | `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports a heartbeat for the service. | |
| heartbeatTags.put("leader", 1); | ||
| emitter.flush(); | ||
| monitor.doMonitor(emitter); | ||
| Assert.assertEquals(1, emitter.getEvents().size()); | ||
| Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("leader")); | ||
| Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric")); | ||
| Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value")); |
There was a problem hiding this comment.
I don't think this part of the test is needed. Instead, can you please add a test for adding more than 1 dimension to the metric.
| heartbeatTags.put("leader", 1); | |
| emitter.flush(); | |
| monitor.doMonitor(emitter); | |
| Assert.assertEquals(1, emitter.getEvents().size()); | |
| Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("leader")); | |
| Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric")); | |
| Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value")); |
And another test for no dimensions in the heartbeatTagsSupplier
|
|
||
| |Metric|Description|Dimensions|Normal Value| | ||
| |------|-----------|----------|------------| | ||
| |`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1| |
There was a problem hiding this comment.
Nit: I think we should prefix this metric with either cluster/ or server/ so that this metric becomes cluster/heartbeat or server/heartbeat. We can add other relevant metrics later which have the same prefix.
The prefix druid/ doesn't seem to give any info about the metric.
@suneet-s , @YongGang , what do you think?
There was a problem hiding this comment.
I don't have a strong preference. We could also just call the metric heartbeat instead of druid/hearbeat
The Druid process might be running in a container, in which case server can be misleading. Do you have some examples in mind of other metrics that would live under cluster/ that wouldn't make sense under druid/?
There was a problem hiding this comment.
No, I don't have a concrete example in mind either. I was thinking mostly of any cluster-level information, inter-service communication, etc.
I agree with you that server/ can be misleading when running on containers. I avoided using it in a PR due to similar reasons. How about service/ as an alternative? :)
I am working on a PR where I think I am going to use cluster/ for server view syncs. e.g. cluster/serverview/synced which denotes the sync status between coordinator/broker inventory and different historical/peon processes.
druid/ is pretty much a catch-all and any metric that goes under cluster/ could potentially go under druid/ as well. But I generally try to use prefixes that make the metrics a little more user-friendly and somewhat self-explanatory.
There was a problem hiding this comment.
service/ prefix seems fine to me as the heartbeat doesn't have to be on cluster level.
|
The intelliJ inspections failure looks legitimate I have re-triggered the other failing jobs 🤞 |
I removed the Exception declaration (was added by IntelliJ auto gen code). |
d8b277f to
99c1b3b
Compare
…vice leader status
99c1b3b to
3b939c9
Compare
|
I rebased from master again hope to fix the IT (ITNestedQueryPushDownTest) failure. |
|
@YongGang If you run Druid with the |
|
Thanks @suneet-s , now changed to bind HeartbeatSupplier conditionally in Coordinator |
|
Build succeeded! But I don't have write access, please help merge the PR. |
* Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor service leader status * make the monitor more general * resolve conflict * use Supplier pattern to provide metrics * reformat code and doc * move service specific tag to dimension * minor refine * update doc * reformat code * address comments * remove declared exception * bind HeartbeatSupplier conditionally in Coordinator
Description
There are cases that no leader for Overlord/Coordinator and cases that multiple leaders selected, this could happen when service is over loaded or network partition happened.
This PR adds a general heartbeat metric to indicate service health. For Overlord/Coordinator, the sum of
druid/heartbeatmetric withleader=1dimension should always be one in a heathy Druid cluster.The new metric example for Coordinator:
{"feed":"metrics","leader":1,"metric":"druid/heartbeat","service":"coordinator","host":"localhost:8081","version":"","value":1,"timestamp":"2023-06-21T21:55:53.216Z"}For Overlord:
{"feed":"metrics","leader":1,"metric":"druid/heartbeat","service":"overlord","host":"localhost:8090","version":"","value":1,"timestamp":"2023-06-21T21:34:01.764Z"}Other service need to provide following supplier to report
druid/heartbeatmetric:Release note
Add a new Monitor to monitor the health of Overlord and Coordinator service.
Key changed/added classes in this PR
ServiceStatusMonitorThis PR has: