KAFKA-10165: Remove Percentiles from e2e metrics#8882
KAFKA-10165: Remove Percentiles from e2e metrics#8882vvcephei merged 2 commits intoapache:trunkfrom vvcephei:kafka-10165-remove-percentile-metrics
Conversation
vvcephei
left a comment
There was a problem hiding this comment.
Hey @ableegoldman , WDYT of these fixes?
There was a problem hiding this comment.
This previously relied on a lookup of the actual current system time. I thought we decided to use the cached system time. Can you set me straight, @ableegoldman ?
There was a problem hiding this comment.
Oh, hm, I thought we decided to push the stateful-node-level metrics to TRACE so we could get the actual time at each node without a (potential) performance hit. But with the INFO-level metrics it would be ok since we're only updating it twice per process.
But maybe I'm misremembering...I suppose ideally we could run some benchmarks for both cases and see if it really makes a difference...
There was a problem hiding this comment.
Ok, but right now, this is an INFO level metric, right?
There was a problem hiding this comment.
This will probably need to get refactored when you do the second PR.
There was a problem hiding this comment.
Yeah. I'm just not 100% sure we all agreed it was alright to get the actual system time even for the task-level metrics ... so we should probably stick with the cached time for now
There was a problem hiding this comment.
Standby tasks don't currently register any sensors, but I personally rather to be defensive and idempotently ensure we remove any sensors while closing.
There was a problem hiding this comment.
Sounds good. But why do it both here and in closeDirty vs doing so in close(clean)?
There was a problem hiding this comment.
I'd like to inline close(boolean), but am resisting the urge... This is a compromise ;)
There was a problem hiding this comment.
Fair enough. I thought the answer might be something like that...
There was a problem hiding this comment.
We previously relied on the task manager to remove these sensors before calling close, but forgot to do it before recycling. In retrospect, it's better to do it within the same class that creates the sensors to begin with.
There was a problem hiding this comment.
Agreed, we should clean up anything we created in the same class
There was a problem hiding this comment.
Fixes the sensor leak by simply registering these as task-level sensors. Note the node name is still provided to scope the sensors themselves.
There was a problem hiding this comment.
We erroneously ignored the provided recordingLevel and set them to debug. It didn't manifest because this method happens to always be called with a recordingLevel of debug anyway.
There was a problem hiding this comment.
Dropped the percentiles metric.
There was a problem hiding this comment.
Github won't let me comment on these lines, but we should remove the two percentiles-necessitated constants above (PERCENTILES_SIZE_IN_BYTES and MAXIMUM_E2E_LATENCY)
There was a problem hiding this comment.
Ah, missed those. Thanks!
There was a problem hiding this comment.
Just cleaning up some oddball literals.
There was a problem hiding this comment.
This test wasn't really testing the "terminal node" code path in ProcessorContextImpl, just that this overload actually fetches the current system time. Since I removed the overload, we don't need the test.
There was a problem hiding this comment.
Verified this fails on trunk.
ableegoldman
left a comment
There was a problem hiding this comment.
LGTM, thanks for picking this up!
|
Rebased on trunk. |
|
All failures unrelated (were different in each build): |
* Remove problematic Percentiles measurements until the implementation is fixed * Fix leaking e2e metrics when task is closed * Fix leaking metrics when tasks are recycled Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>
|
cherry-picked to 2.6 |
| final Map<String, String> tagMap = streamsMetrics.nodeLevelTagMap(threadId, taskId, processorNodeId); | ||
| addMinAndMaxToSensor( | ||
| sensor, | ||
| PROCESSOR_NODE_LEVEL_GROUP, |
There was a problem hiding this comment.
@vvcephei I'm not familiar enough with the metrics classification to know if this will be an issue or just an oddity, but we now have allegedly task-level metrics but with the processor-node-level tags/grouping. It's kind of a "task metric in implementation, processor node metric in interface" -- might be confusing for us but should be alright for users, yeah?
There was a problem hiding this comment.
We give it the task sensor prefix which becomes part of the full sensor name, rather than the processor node prefix
* 'trunk' of github.com:apache/kafka: KAFKA-10168: fix StreamsConfig parameter name variable (apache#8865) MINOR: code cleanup for inconsistent naming (apache#8871) KAFKA-10138: Prefer --bootstrap-server for reassign_partitions command in ducktape tests (apache#8898) KAFKA-10185: Restoration info logging (apache#8896) KAFKA-9891: add integration tests for EOS and StandbyTask (apache#8890) MINOR: Reduce build time by gating test coverage plugins behind a flag (apache#8899) KAFKA-10141; Add more detail to log segment delete messages (apache#8850) KAFKA-10113; Specify fetch offsets correctly in `LogTruncationException` (apache#8822) KAFKA-10167: use the admin client to read end-offset (apache#8876) MINOR: Upgrade ducktape to 0.7.8 (apache#8879) KAFKA-10123; Fix incorrect value for AWAIT_RESET#hasPosition (apache#8841) KAFKA-9896: fix flaky StandbyTaskEOSIntegrationTest (apache#8883) MINOR: clean up unused checkstyle suppressions for Streams (apache#8861) MINOR: reuse toConfigObject(Map) to generate Config (apache#8889) MINOR: Upgrade jetty to 9.4.27.v20200227 and jersey to 2.31 (apache#8859) MINOR: Fix flaky HighAvailabilityTaskAssignorIntegrationTest (apache#8884) KAFKA-10147 MockAdminClient#describeConfigs(Collection<ConfigResource>) is unable to handle broker resource (apache#8853) KAFKA-10165: Remove Percentiles from e2e metrics (apache#8882) # Conflicts: # core/src/main/scala/kafka/log/Log.scala
* Remove problematic Percentiles measurements until the implementation is fixed * Fix leaking e2e metrics when task is closed * Fix leaking metrics when tasks are recycled Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>
Committer Checklist (excluded from commit message)