Skip to content

KAFKA-10165: Remove Percentiles from e2e metrics#8882

Merged
vvcephei merged 2 commits intoapache:trunkfrom
vvcephei:kafka-10165-remove-percentile-metrics
Jun 17, 2020
Merged

KAFKA-10165: Remove Percentiles from e2e metrics#8882
vvcephei merged 2 commits intoapache:trunkfrom
vvcephei:kafka-10165-remove-percentile-metrics

Conversation

@vvcephei
Copy link
Copy Markdown
Contributor

  • Remove problematic Percentiles measurements until the implementation is fixed
  • Fix leaking e2e metrics when task is closed
  • Fix leaking metrics when tasks are recycled

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Contributor Author

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ableegoldman , WDYT of these fixes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previously relied on a lookup of the actual current system time. I thought we decided to use the cached system time. Can you set me straight, @ableegoldman ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hm, I thought we decided to push the stateful-node-level metrics to TRACE so we could get the actual time at each node without a (potential) performance hit. But with the INFO-level metrics it would be ok since we're only updating it twice per process.
But maybe I'm misremembering...I suppose ideally we could run some benchmarks for both cases and see if it really makes a difference...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but right now, this is an INFO level metric, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably need to get refactored when you do the second PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm just not 100% sure we all agreed it was alright to get the actual system time even for the task-level metrics ... so we should probably stick with the cached time for now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standby tasks don't currently register any sensors, but I personally rather to be defensive and idempotently ensure we remove any sensors while closing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. But why do it both here and in closeDirty vs doing so in close(clean)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to inline close(boolean), but am resisting the urge... This is a compromise ;)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I thought the answer might be something like that...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously relied on the task manager to remove these sensors before calling close, but forgot to do it before recycling. In retrospect, it's better to do it within the same class that creates the sensors to begin with.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we should clean up anything we created in the same class

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes the sensor leak by simply registering these as task-level sensors. Note the node name is still provided to scope the sensors themselves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We erroneously ignored the provided recordingLevel and set them to debug. It didn't manifest because this method happens to always be called with a recordingLevel of debug anyway.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the percentiles metric.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github won't let me comment on these lines, but we should remove the two percentiles-necessitated constants above (PERCENTILES_SIZE_IN_BYTES and MAXIMUM_E2E_LATENCY)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed those. Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cleaning up some oddball literals.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test wasn't really testing the "terminal node" code path in ProcessorContextImpl, just that this overload actually fetches the current system time. Since I removed the overload, we don't need the test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified this fails on trunk.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

Copy link
Copy Markdown
Member

@ableegoldman ableegoldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for picking this up!

@vvcephei
Copy link
Copy Markdown
Contributor Author

Rebased on trunk.

@vvcephei
Copy link
Copy Markdown
Contributor Author

All failures unrelated (were different in each build):

org.apache.kafka.streams.integration.OptimizedKTableIntegrationTest.shouldApplyUpdatesToStandbyStore
kafka.admin.ReassignPartitionsUnitTest.testModifyBrokerThrottles
org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication

@vvcephei vvcephei merged commit 147ffb9 into apache:trunk Jun 17, 2020
@vvcephei vvcephei deleted the kafka-10165-remove-percentile-metrics branch June 17, 2020 14:24
vvcephei added a commit that referenced this pull request Jun 17, 2020
* Remove problematic Percentiles measurements until the implementation is fixed
* Fix leaking e2e metrics when task is closed
* Fix leaking metrics when tasks are recycled

Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>
@vvcephei
Copy link
Copy Markdown
Contributor Author

cherry-picked to 2.6

Comment on lines +152 to +155
final Map<String, String> tagMap = streamsMetrics.nodeLevelTagMap(threadId, taskId, processorNodeId);
addMinAndMaxToSensor(
sensor,
PROCESSOR_NODE_LEVEL_GROUP,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvcephei I'm not familiar enough with the metrics classification to know if this will be an issue or just an oddity, but we now have allegedly task-level metrics but with the processor-node-level tags/grouping. It's kind of a "task metric in implementation, processor node metric in interface" -- might be confusing for us but should be alright for users, yeah?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We give it the task sensor prefix which becomes part of the full sensor name, rather than the processor node prefix

Kvicii pushed a commit to Kvicii/kafka that referenced this pull request Jun 21, 2020
* 'trunk' of github.com:apache/kafka:
  KAFKA-10168: fix StreamsConfig parameter name variable (apache#8865)
  MINOR: code cleanup for inconsistent naming (apache#8871)
  KAFKA-10138: Prefer --bootstrap-server for reassign_partitions command in ducktape tests (apache#8898)
  KAFKA-10185: Restoration info logging (apache#8896)
  KAFKA-9891: add integration tests for EOS and StandbyTask (apache#8890)
  MINOR: Reduce build time by gating test coverage plugins behind a flag (apache#8899)
  KAFKA-10141; Add more detail to log segment delete messages (apache#8850)
  KAFKA-10113; Specify fetch offsets correctly in `LogTruncationException` (apache#8822)
  KAFKA-10167: use the admin client to read end-offset (apache#8876)
  MINOR: Upgrade ducktape to 0.7.8 (apache#8879)
  KAFKA-10123; Fix incorrect value for AWAIT_RESET#hasPosition (apache#8841)
  KAFKA-9896: fix flaky StandbyTaskEOSIntegrationTest (apache#8883)
  MINOR: clean up unused checkstyle suppressions for Streams (apache#8861)
  MINOR: reuse toConfigObject(Map) to generate Config (apache#8889)
  MINOR: Upgrade jetty to 9.4.27.v20200227 and jersey to 2.31 (apache#8859)
  MINOR: Fix flaky HighAvailabilityTaskAssignorIntegrationTest (apache#8884)
  KAFKA-10147 MockAdminClient#describeConfigs(Collection<ConfigResource>) is unable to handle broker resource (apache#8853)
  KAFKA-10165: Remove Percentiles from e2e metrics (apache#8882)

# Conflicts:
#	core/src/main/scala/kafka/log/Log.scala
vvcephei added a commit to confluentinc/kafka that referenced this pull request Jun 26, 2020
* Remove problematic Percentiles measurements until the implementation is fixed
* Fix leaking e2e metrics when task is closed
* Fix leaking metrics when tasks are recycled

Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants