KAFKA-10165: Remove Percentiles from e2e metrics by vvcephei · Pull Request #8882 · apache/kafka

vvcephei · 2020-06-16T20:14:37Z

Remove problematic Percentiles measurements until the implementation is fixed
Fix leaking e2e metrics when task is closed
Fix leaking metrics when tasks are recycled

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

vvcephei

Hey @ableegoldman , WDYT of these fixes?

vvcephei · 2020-06-16T20:15:44Z

This previously relied on a lookup of the actual current system time. I thought we decided to use the cached system time. Can you set me straight, @ableegoldman ?

Oh, hm, I thought we decided to push the stateful-node-level metrics to TRACE so we could get the actual time at each node without a (potential) performance hit. But with the INFO-level metrics it would be ok since we're only updating it twice per process.
But maybe I'm misremembering...I suppose ideally we could run some benchmarks for both cases and see if it really makes a difference...

Ok, but right now, this is an INFO level metric, right?

This will probably need to get refactored when you do the second PR.

Yeah. I'm just not 100% sure we all agreed it was alright to get the actual system time even for the task-level metrics ... so we should probably stick with the cached time for now

vvcephei · 2020-06-16T20:16:53Z

Standby tasks don't currently register any sensors, but I personally rather to be defensive and idempotently ensure we remove any sensors while closing.

Sounds good. But why do it both here and in closeDirty vs doing so in close(clean)?

I'd like to inline close(boolean), but am resisting the urge... This is a compromise ;)

Fair enough. I thought the answer might be something like that...

vvcephei · 2020-06-16T20:18:43Z

We previously relied on the task manager to remove these sensors before calling close, but forgot to do it before recycling. In retrospect, it's better to do it within the same class that creates the sensors to begin with.

Agreed, we should clean up anything we created in the same class

vvcephei · 2020-06-16T20:19:35Z

Fixes the sensor leak by simply registering these as task-level sensors. Note the node name is still provided to scope the sensors themselves.

vvcephei · 2020-06-16T20:21:10Z

We erroneously ignored the provided recordingLevel and set them to debug. It didn't manifest because this method happens to always be called with a recordingLevel of debug anyway.

vvcephei · 2020-06-16T20:21:43Z

Dropped the percentiles metric.

Github won't let me comment on these lines, but we should remove the two percentiles-necessitated constants above (PERCENTILES_SIZE_IN_BYTES and MAXIMUM_E2E_LATENCY)

Ah, missed those. Thanks!

vvcephei · 2020-06-16T20:22:52Z

Just cleaning up some oddball literals.

vvcephei · 2020-06-16T20:24:06Z

This test wasn't really testing the "terminal node" code path in ProcessorContextImpl, just that this overload actually fetches the current system time. Since I removed the overload, we don't need the test.

vvcephei · 2020-06-16T20:24:47Z

Verified this fails on trunk.

ableegoldman · 2020-06-16T22:22:11Z

ableegoldman

LGTM, thanks for picking this up!

vvcephei · 2020-06-17T03:35:02Z

Rebased on trunk.

vvcephei · 2020-06-17T14:22:51Z

All failures unrelated (were different in each build):

org.apache.kafka.streams.integration.OptimizedKTableIntegrationTest.shouldApplyUpdatesToStandbyStore
kafka.admin.ReassignPartitionsUnitTest.testModifyBrokerThrottles
org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication

* Remove problematic Percentiles measurements until the implementation is fixed * Fix leaking e2e metrics when task is closed * Fix leaking metrics when tasks are recycled Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>

vvcephei · 2020-06-17T14:36:58Z

cherry-picked to 2.6

ableegoldman · 2020-06-17T19:25:25Z

+        final Map<String, String> tagMap = streamsMetrics.nodeLevelTagMap(threadId, taskId, processorNodeId);
+        addMinAndMaxToSensor(
+            sensor,
+            PROCESSOR_NODE_LEVEL_GROUP,


@vvcephei I'm not familiar enough with the metrics classification to know if this will be an issue or just an oddity, but we now have allegedly task-level metrics but with the processor-node-level tags/grouping. It's kind of a "task metric in implementation, processor node metric in interface" -- might be confusing for us but should be alright for users, yeah?

We give it the task sensor prefix which becomes part of the full sensor name, rather than the processor node prefix

* 'trunk' of github.com:apache/kafka: KAFKA-10168: fix StreamsConfig parameter name variable (apache#8865) MINOR: code cleanup for inconsistent naming (apache#8871) KAFKA-10138: Prefer --bootstrap-server for reassign_partitions command in ducktape tests (apache#8898) KAFKA-10185: Restoration info logging (apache#8896) KAFKA-9891: add integration tests for EOS and StandbyTask (apache#8890) MINOR: Reduce build time by gating test coverage plugins behind a flag (apache#8899) KAFKA-10141; Add more detail to log segment delete messages (apache#8850) KAFKA-10113; Specify fetch offsets correctly in `LogTruncationException` (apache#8822) KAFKA-10167: use the admin client to read end-offset (apache#8876) MINOR: Upgrade ducktape to 0.7.8 (apache#8879) KAFKA-10123; Fix incorrect value for AWAIT_RESET#hasPosition (apache#8841) KAFKA-9896: fix flaky StandbyTaskEOSIntegrationTest (apache#8883) MINOR: clean up unused checkstyle suppressions for Streams (apache#8861) MINOR: reuse toConfigObject(Map) to generate Config (apache#8889) MINOR: Upgrade jetty to 9.4.27.v20200227 and jersey to 2.31 (apache#8859) MINOR: Fix flaky HighAvailabilityTaskAssignorIntegrationTest (apache#8884) KAFKA-10147 MockAdminClient#describeConfigs(Collection<ConfigResource>) is unable to handle broker resource (apache#8853) KAFKA-10165: Remove Percentiles from e2e metrics (apache#8882) # Conflicts: # core/src/main/scala/kafka/log/Log.scala

* Remove problematic Percentiles measurements until the implementation is fixed * Fix leaking e2e metrics when task is closed * Fix leaking metrics when tasks are recycled Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>

vvcephei commented Jun 16, 2020

View reviewed changes

ableegoldman reviewed Jun 16, 2020

View reviewed changes

Comment thread streams/src/test/java/org/apache/kafka/streams/processor/internals/StandbyTaskTest.java Outdated

Copy link
Copy Markdown

Member

ableegoldman Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

ableegoldman approved these changes Jun 16, 2020

View reviewed changes

John Roesler added 2 commits June 16, 2020 22:35

KAFKA-10165: Remove Percentiles from e2e metrics

7c3c51f

separate test case for closeAndRecycle metrics

ae1a390

vvcephei merged commit 147ffb9 into apache:trunk Jun 17, 2020

vvcephei deleted the kafka-10165-remove-percentile-metrics branch June 17, 2020 14:24

ableegoldman reviewed Jun 17, 2020

View reviewed changes

Conversation

vvcephei commented Jun 16, 2020

Committer Checklist (excluded from commit message)

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman left a comment

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Jun 17, 2020

Uh oh!

vvcephei commented Jun 17, 2020

Uh oh!

vvcephei commented Jun 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants