Skip to content

KAFKA-10199: Add task updater metrics, part 2#13300

Merged
mjsax merged 7 commits intoapache:trunkfrom
guozhangwang:K10199-task-metrics-p2
Apr 5, 2023
Merged

KAFKA-10199: Add task updater metrics, part 2#13300
mjsax merged 7 commits intoapache:trunkfrom
guozhangwang:K10199-task-metrics-p2

Conversation

@guozhangwang
Copy link
Copy Markdown
Contributor

@guozhangwang guozhangwang commented Feb 24, 2023

  1. Added task-level metrics as part of KIP-869:
  • restore-rate, restore-total (for active)
  • update-rate, update-total (for standby)
  • restore-remaining-records-total (for active)
  1. Fixed some naming confusions in the XXXMetrics classes, especially for distinguishing sensor name constructs and metric name constructs; also fixed a recording bug on dropped record sensor.

  2. Also cleaned up a couple unused metrics and util functions in the metrics classes.

  3. Add related unit tests.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Contributor Author

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucasbru @cadonna would appreciate your reviews while I add unit tests in parallel.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for measuring remaining records is a bit complex: we first aggregate the total amount of records to restore across all changelog partitions at the beginning when initializing the changelogs; and then during restoration we keep decrementing by the number of restored records.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to rename this method to initRestoreRemaining to make it's purpuse clear (and/or add a JavaDoc to the method in StreamTask)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used in prod code, hence cleaning it up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a piggy-backed metric fix: we should use cumulativeSum than cumulativeCount for dropped records, even though today with most callees as sensor.record() it is effectively the same as it only increment by 1, it is still vulnerable in case we record a non-one value in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is removed as part of KIP-743, and it's only used in tests (which I also cleaned up as a piggy-back).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we remove this null check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pondered on the code and I think it should not be null ever? Please correct me if I'm wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjsax LMK what do you think? I may lack some background here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just wondering. It seemed to be unrelated to this PR, and so I just assumed there must be a reason for having the null check.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to rename this method to initRestoreRemaining to make it's purpuse clear (and/or add a JavaDoc to the method in StreamTask)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to apply max() ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very compelling reasons, I just want to make sure we do not start with a negative number, but I cannot think of a case that it could be negative.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment to cover the very edge case when it could become negative.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restore or restored ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string is not only used as part of the metric name, but also used as part of the sensor name (suffix to be more specific). In our KIP proposal it's defined as, e.g. restore-rate | total so it's correct to be defined as restore/update.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update or updated?

@mjsax mjsax added streams kip Requires or implements a KIP labels Mar 3, 2023
@guozhangwang guozhangwang force-pushed the K10199-task-metrics-p2 branch from 494c079 to 9b8531b Compare April 5, 2023 18:27
@mjsax mjsax merged commit 653baa6 into apache:trunk Apr 5, 2023
@showuon showuon mentioned this pull request Apr 6, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants