Add more metric reporting to MSQ#18121
Conversation
gianm
left a comment
There was a problem hiding this comment.
Just a few line comments. Please also include tests that verify the controller and workers are emitting the correct metrics with the correct dimensions. One way you can do that is:
- update
MSQTestControllerContextandMSQTestWorkerContextto haveemitMetricwrite emitted metrics into some list - provide a mechanism for test cases to get the emitted metrics
| context.emitMetric( | ||
| "query/time", | ||
| MSQMetricUtils.createQueryMetricDimensions(datasources, intervals, success), | ||
| startTime - System.currentTimeMillis() |
There was a problem hiding this comment.
I still feel that the context#emitMetric method is odd; it could be something like:
MsqServiceMetricEventBuilder metricBuilder = new MsqServiceMetricEventBuilder(querySpec);
metricBuilder.setDimension(DruidMetrics.DATASOURCE, DefaultQueryMetrics.getTableNamesAsString(datasources));
metricBuilder.setDimension(DruidMetrics.INTERVAL, DefaultQueryMetrics.getIntervalsAsStringArray(intervals));
metricBuilder.setDimension(DruidMetrics.DURATION, BaseQuery.calculateDuration(intervals));
metricBuilder.setDimension(DruidMetrics.SUCCESS, success);
metricBuilder.setMetric("query/time", startTime - System.currentTimeMillis());
context.getEmitter().emit(metricBuilder);
probably MsqServiceMetricEventBuilder could also provide a more natural home for those static methods which could become instance methods
..or alter the signature of context#emitrMetric to context#emit(metricBuilder) ? so that the context could still
add stuff if it wants - but the caller has the control.
There was a problem hiding this comment.
sorry about the above author :D it was me...I was just checking some github features ; which involved switching between 2 users ....
There was a problem hiding this comment.
Ah, yes, that would work too. It's kind of equivalent (the metricBuilder has a metric name, value, and dimensions in it) but bundled together more nicely. @adarshsanjeev - could you try this approach? It looks like it could be cleaner.
There was a problem hiding this comment.
@kgyrtkirk Made these changes. The function now takes a service emitter.
There's an awkward bit in TaskControllerContext, where we have to check if the dataSource has already been provided in the metricBuilder before setting the default value, for backward compatibility. I wonder if it is worth adding yet another function of getDefaultMetricBuilder() from the context instead, so that the caller can decide what all to override instead of this weird code.
Please let me know your thoughts.
kgyrtkirk
left a comment
There was a problem hiding this comment.
thank you for the updates! looks good
left a few notes here and there - the IT failures are unrelated
| Set<String> datasourceDim = (Set<String>) queryMetricDimensions.computeIfAbsent(DruidMetrics.DATASOURCE, s -> new HashSet<>()); | ||
| Set<Interval> intervalDim = (Set<Interval>) queryMetricDimensions.computeIfAbsent(DruidMetrics.INTERVAL, s -> new HashSet<>()); | ||
| for (StageDefinition stageDef : queryDef.getStageDefinitions()) { | ||
| datasourceDim.addAll(MSQMetricUtils.getDatasources(stageDef)); |
There was a problem hiding this comment.
nit: is this a one time need - or maybe there should be queryDef.getDatasources ?
There was a problem hiding this comment.
Currently, I believe this is a one time need. Now that I have changed getDatasources to a member of stageDefinition, maybe this isn't required.
* concurrency issue with StubServiceEmitter - #18121 have added a few new metrics which have increased the load on it and have caused random appearances of an issue arising from the fact it used an `ArrayList` under the hood. * added some catches to shut down queries properly in case some unexpected exceptions occur - this could give better exceptions and reduce time to fix in the future this should reduce the probability that `QTest` splits remain hanging
* concurrency issue with StubServiceEmitter - apache#18121 have added a few new metrics which have increased the load on it and have caused random appearances of an issue arising from the fact it used an `ArrayList` under the hood. * added some catches to shut down queries properly in case some unexpected exceptions occur - this could give better exceptions and reduce time to fix in the future this should reduce the probability that `QTest` splits remain hanging
* concurrency issue with StubServiceEmitter - #18121 have added a few new metrics which have increased the load on it and have caused random appearances of an issue arising from the fact it used an `ArrayList` under the hood. * added some catches to shut down queries properly in case some unexpected exceptions occur - this could give better exceptions and reduce time to fix in the future this should reduce the probability that `QTest` splits remain hanging Co-authored-by: Zoltan Haindrich <kirk@rxd.hu>
Adds reporting of additional metrics and dimensions to MSQ task and Dart.
Summary
Currently, MSQ reports the a few metrics. (
ingest/tombstones/count,ingest/segments/countandingest/input/bytes).This PR adds a few new metrics:
query/time: Reported by controller and worker at the end of the query.query/cpu/time: Reported by each worker at the end of the query.Additionally, adds some more dimensions to the metrics emitted. Dimensions added to metrics:
queryIdsqlQueryIdengine: Denotes the engine used for the query,msq-dartormsq-task.dartQueryId: (Dart only)type: AlwaysmsqdataSourceintervaldurationsuccessThis PR has: