prometheus metric exporter#10412
Conversation
| @@ -0,0 +1,128 @@ | |||
| { | |||
| "query/time" : { "dimensions" : ["dataSource", "type"], "type" : "timer", "conversionFactor": 1000.0, "help": "Seconds taken to complete a query."}, | |||
There was a problem hiding this comment.
Are these conversion a good idea?
It would mean that these metrics will be slightly different from how they are described in this documentation. https://druid.apache.org/docs/latest/operations/metrics.html
There was a problem hiding this comment.
https://prometheus.io/docs/practices/naming/#base-units
It's coming from prometheus common practice.
There was a problem hiding this comment.
Okay. I guess there are tradeoffs with either choices. Maybe a good way to do is that for those converted, we can just put the unit in the prometheus names. Otherwise, if we refer to the druid metrics doc, we would find the name of the metrics to be documented in a different unit.
| "query/failed/count" : { "dimensions" : [], "type" : "count", "help": "Number of failed queries"}, | ||
| "query/interrupted/count" : { "dimensions" : [], "type" : "count", "help": "Number of queries interrupted due to cancellation or timeout"}, | ||
|
|
||
| "query/cache/delta/numEntries" : { "dimensions" : [], "type" : "count", "help": "Number of entries in cache"}, |
There was a problem hiding this comment.
potential bug, deltas can be negative but Prometheus counter accepts only non-negative increment.
There was a problem hiding this comment.
Will change to guage. This only happens in since last emission we got more entries evicted than added right.
| private final Metrics metrics; | ||
| private final PrometheusEmitterConfig config; | ||
| private final PrometheusEmitterConfig.Strategy strategy; | ||
| private final Pattern pattern = Pattern.compile("[^a-zA-Z0-9_][^a-zA-Z0-9_]*"); |
There was a problem hiding this comment.
Reuse the pattern in PrometheusEmitterConfig
There was a problem hiding this comment.
These two is not the same regex. The one in PromtheusEmitterConfig is for namespace regex that need to start with alphabetic character.
| private static final Logger log = new Logger(Metrics.class); | ||
| private final Map<String, DimensionsAndCollector> map = new HashMap<>(); | ||
| private final ObjectMapper mapper = new ObjectMapper(); | ||
| private final Pattern pattern = Pattern.compile("[^a-zA-Z_:][^a-zA-Z0-9_:]*"); |
There was a problem hiding this comment.
Reuse the pattern in PrometheusEmitterConfig
There was a problem hiding this comment.
used the one in PrometheusEmitter.java
| } | ||
| } | ||
|
|
||
| public Map<String, DimensionsAndCollector> getMap() |
There was a problem hiding this comment.
maybe we can rename the map to registeredMetrics and then we could rename this method to getRegisteredMetrics(), I feel like this will be easier to read
| }).readValue(is); | ||
| } | ||
| catch (IOException e) { | ||
| throw new ISE(e, "Failed to parse metric dimensions and types"); |
| } | ||
| } | ||
|
|
||
| void emitMetric(ServiceMetricEvent metricEvent) |
| Map<String, DimensionsAndCollector> map = metrics.getMap(); | ||
| try { | ||
| for (DimensionsAndCollector collector : map.values()) { | ||
| pushGateway.push(collector.getCollector(), config.getNamespace(), ImmutableMap.of(config.getNamespace(), identifier)); |
There was a problem hiding this comment.
potential NPE? if the configured strategy is not pushgateway, then this pushGateway wouldn't have been instantiated
There was a problem hiding this comment.
Also should we use a more meaningful label name for identifier instead of using the config.getNamespace()?
There was a problem hiding this comment.
Will add the null check - however flush() for this emitter should only called by close() which strategy check already done.
There was a problem hiding this comment.
For the identifier label name, any suggestion? The config.namespace will be set in config files for each service. So for example peon task it could be peon=taskXXX as groupingKey.
|
|
||
|
|
||
| @Override | ||
| public void start() |
There was a problem hiding this comment.
we should schedule a task to push updates periodically when the strategy is set to pushgateway
There was a problem hiding this comment.
Added, every 5min sounds reasonable?
There was a problem hiding this comment.
Sorry I missed this - I think the scheduled executor may not be necessary. Main reason we've added strategy pushgateway is for things that are potentially too short-lived to be scraped by prometheus (in druid that's really just peon tasks). Things that are living long enough to be pushing every 5 minutes are likely not "task" based, and may be better fit for normal scraping. I lean toward keeping things simple, and pushing once at close seems sufficient.
There was a problem hiding this comment.
Given the only metric pushed by peon is "last pushed timestamp", I think it's valid to remove the scheduled task. Removed.
|
|
||
| public DimensionsAndCollector getByName(String name, String service) | ||
| { | ||
| if (map.containsKey(name)) { |
There was a problem hiding this comment.
return Optional.ofNullable(map.get(name)).orElse(map.get(service + "_" + name));
There was a problem hiding this comment.
Changed the second part to a getOrDefault() for simplification. I don't see the need of changing this function return type from DimensionsAndCollector to Optional<DimensionsAndCollector>
There was a problem hiding this comment.
you don't need to change the return type, but anyway this is a minor comment, feel free to pick whichever you prefer
|
Is there a timeline for this feature? |
|
For what is worth, Ive been running production clusters with this extension for monitoring for over a year. Things are stable but there are a couple open issues around pushgateway collection of Peon task metrics that are the main reasons to delay merging. |
|
Verification of metrics on a local mini druid cluster as shown in below table. Coordinator, historical, broker, router, middle-manager, overlord set up as prometheus monitoring target.
|
|
@Tiaaa awesome! I see peon metrics in pushgateway as well. One thing - the label we're using right now is the task ID. I think this it going to be too high cardinality for Prometheus |
|
hmm, it looks like the commits have become sort of messed up for this PR one way or another and github is showing a lot of unrelated commits. @Tiaaa any chance you can try clean this up to only show the changes of this PR to make it easier to review? |
… updated statsd json)
…are not using seconds. Correct metric dimension name in documentation
…st multiple calls to PrometheusEmitter.start()
…alues. Additional tests
4b0b414 to
3a7a2b6
Compare
|
Any update on this ? Looking forward having it. |
|
@clintropolis it looks like this is the last failing build step: https://travis-ci.com/github/apache/druid/jobs/483326128 - seems unrelated to the new emitter. Are we good to merge? |
Sorry for the delay, I will have a look as soon as I'm able and see if we can get this merged 👍 |
clintropolis
left a comment
There was a problem hiding this comment.
A few minor comments, but overall lgtm 👍 I don't know prometheus very well, but I think the mappings of metrics looks reasonable.
Thanks for your patience and persistence!
| TaskSlotCountStatsMonitor | ||
| bufferCapacity | ||
| bufferpoolName | ||
| bufferPoolName |
There was a problem hiding this comment.
oops, missed one, it's causing CI to fail
|
@clintropolis anything left we need to do before merge? |
|
Will this be available în the next 0.21.0 release? Thank you |
Oops no, sorry got distracted and hadn't got back to this yet.
Unfortunately we have already cut the branch for 0.21.0, after which we only merge bug fixes, so this will go out in the release after that. 0.21.0 has been a bit delayed, so it shouldn't be too much longer before we begin the next release as well. |

Fixes #8621
Adds a new extension prometheus-emitter to expose Druid metrics for collection directly by a Prometheus server.
This PR has:
Key changed/added classes in this PR
This is re-open the PR here #8621