KAFKA-16446: Improve controller event duration logging by mumrah · Pull Request #15622 · apache/kafka

mumrah · 2024-03-28T21:12:32Z

There are times when the controller has a high event processing time such as during startup or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time.

With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time.

This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this:

[2024-12-20 15:03:39,754] ERROR [QuorumController id=1] Exceptionally slow controller event createTopics took 5240 ms.  (org.apache.kafka.controller.EventPerformanceMonitor)

Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this:

[2024-12-20 15:35:04,798] INFO [QuorumController id=1] In the last 60000 ms period, 333 events were completed, which took an average of 12.34 ms each. The slowest event was handleCommit[baseOffset=0], which took 41.90 ms. (org.apache.kafka.controller.EventPerformanceMonitor)

An operator can disable these logs by adding the following to their log4j config:

org.apache.kafka.controller.EventPerformanceMonitor=OFF

soarez

Did you consider tagging controllerMetrics.updateEventQueueProcessingTime with the event name instead? That would seem like a more general and useful solution.

mumrah · 2024-03-28T23:37:23Z

@soarez wdym by tagging here?

Note that the events have relatively unique names (we include offset in some) so the cardinality is quite high.

soarez · 2024-03-29T00:11:10Z

events have relatively unique names (we include offset in some) so the cardinality is quite high

I see. What I meant would result in too many more metrics. Makes sense then.

github-actions · 2024-08-20T04:28:12Z

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

…-events

mumrah · 2024-12-12T15:30:11Z

@soarez i've brushed off this old PR. PTAL

…-events

soarez

@mumrah thanks for updating this

soarez · 2024-12-12T23:24:28Z

-     * The maximum records that the controller will write in a single batch.
+     * The default maximum records that the controller will write in a single batch.
     */
-    private static final int MAX_RECORDS_PER_BATCH = 10000;
+    private static final int DEFAULT_MAX_RECORDS_PER_BATCH = 10000;


Even though there are no code paths that change the value, within the scope of QuorumController.Builder this is in fact a default.

I'm not sure I understand your comment. Are you just agreeing with the change to DEFAULT_ or something else?

mumrah · 2024-12-13T01:45:56Z

@soarez thanks for the review. Addressed your comments

soarez

Failed CI tests seem unrelated, and wasn't able to reproduce them locally, so LGTM.

Thanks @mumrah!

cmccabe · 2024-12-13T21:26:33Z

I was a bit confused by this PR since I thought that it was going to be about setting a threshold (like 5 seconds or something) and saying that any event that lasted longer than that was so bad, so egregious, that it should always be logged. (Tangent: I still actually do think we should do that!)

But instead this is more like having a background process gathering stats. Which is actually quite useful in its own way. Question, though: why can't we set the logging interval to 60 seconds and just log the longest event unconditionally? It's not that much noise and people can turn it off if they want to. (by setting the log4j for this class)

There's probably a bunch of other stuff we could do like try to get the "top 5" rather than just the top event. But we can always do that in a follow-on, no need to block this one.

I think we should rename SlowEventsLogger since it's more about general performance stats and not just about slow events. Also calling it "logger" sort of makes it sound like part of log4j. Perhaps call it EventPerformanceMonitor?

mumrah · 2024-12-14T02:19:16Z

@cmccabe Thanks for taking a look!

any event that lasted longer than that was so bad, so egregious, that it should always be logged

I thought about this for a while and couldn’t come up with a good threshold. Looking at our CCloud data some clusters run at 10ms average event times so an event of 200ms would be interesting to observe. Other clusters, we are seeing average of 100ms event times, so 200ms isn’t so interesting. That’s what led me to taking a statistical approach.

However, we could definitely add an “always log above this threshold” as a separate thing (with a unique log line).

Question, though: why can’t we set the logging interval to 60 seconds and just log the longest event unconditionally?

We could, though that could make finding some rare event a bit more difficult. Also, if we had a burst of slow events, we would only log one instead of all that were above p99 (rare, but possible due to the histogram behavior).

Perhaps call it EventPerformanceMonitor?

Seems fine to me. Like you said, we could evolve this to capture more stuff in the future.

…-events

mumrah · 2024-12-23T14:02:10Z

+    /**
+     * The total duration of all the events we've seen.
+     */
+    private long totalEventDurationNs;


Recording this is interesting. Since we have a single controller thread and can see its total event duration over a fixed period, we could calculate the idle percentage of the controller. Might be interesting to look at as a high level "busy" metric

Yeah, that might be interesting. It would probably be better to get idle numbers directly from the queue somehow (since it knows how long it waited...) just to avoid errors adding up over time.

mumrah · 2024-12-23T14:06:10Z

+    /**
+     * The always-log threshold in nanoseconds.
+     */
+    private long alwaysLogThresholdNs;


One reason I didn't go with a fixed threshold originally was that I was concerned about a negative feedback loop with a congested controller. For example, if all the events on the controller are above this threshold due to some bug or external influence, doing the extra logging just makes things worse.

However, thinking about it more, maybe the extra logging wouldn't be such a hit. If the controller is congested and the threshold is a few seconds, then the extra few ms to do logging wouldn't change the situation much.

Right. Once you're at 2 seconds per event, another log message isn't going to change much. Also, I hope we're logging in less than a few ms, although with these logging libraries, ya never know...

mumrah · 2024-12-23T14:06:50Z

+     * @param durationNs    The duration in nanoseconds.
+     * @return              The decimal duration in milliseconds.
+     */
+    static String nanosecondsToDecimalMillis(long durationNs) {


naming: maybe "formatNanosAsMillis" or something with "format" in the name?

…og-compaction-write-record-v2 * apache-github/trunk: (34 commits) MINOR: Bump year to 2025 in NOTICE file (apache#18427) KAFKA-18411 Remove ZkProducerIdManager (apache#18413) KAFKA-18408 tweak the 'tag' field for BrokerHeartbeatRequest.json, BrokerRegistrationChangeRecord.json and RegisterBrokerRecord.json (apache#18421) KAFKA-18414 Remove KRaftRegistrationResult (apache#18401) KAFKA-17921 Support SASL_PLAINTEXT protocol with java.security.auth.login.config (apache#17671) KAFKA-18384 Remove ZkAlterPartitionManager (apache#18364) KAFKA-10790: Add deadlock detection to producer#flush (apache#17946) KAFKA-18412: Remove EmbeddedZookeeper (apache#18399) MINOR : Improve Exception log in NotEnoughReplicasException(apache#12394) MINOR: Improve PlaintextAdminIntegrationTest#testConsumerGroups (apache#18409) MINOR: Remove unused local variable (apache#18410) MINOR: Remove RaftManager.maybeDeleteMetadataLogDir and AutoTopicCreationManagerTest.scala (apache#17365) KAFKA-18368 Remove TestUtils#MockZkConnect and remove zkConnect from TestUtils#createBrokerConfig (apache#18352) MINOR: Update Consumer group timeout default to 30 sec (apache#16406) MINOR: Fix typo in CommitRequestManager (apache#18407) MINOR: cleanup JavaDocs for deprecation warnings (apache#18402) KAFKA-18303; Update ShareCoordinator to use new record format (apache#18396) MINOR: Update Consumer and Producer JavaDocs for committing offsets (apache#18336) KAFKA-16446: Improve controller event duration logging (apache#15622) KAFKA-18388 test-kraft-server-start.sh should use log4j2.yaml (apache#18370) ...

There are times when the controller has a high event processing time, such as during startup, or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time. With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time. This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this: ``` [2024-12-20 15:03:39,754] ERROR [QuorumController id=1] Exceptionally slow controller event createTopics took 5240 ms. (org.apache.kafka.controller.EventPerformanceMonitor) ``` Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this: ``` [2024-12-20 15:35:04,798] INFO [QuorumController id=1] In the last 60000 ms period, 333 events were completed, which took an average of 12.34 ms each. The slowest event was handleCommit[baseOffset=0], which took 41.90 ms. (org.apache.kafka.controller.EventPerformanceMonitor) ``` An operator can disable these logs by adding the following to their log4j config: ``` org.apache.kafka.controller.EventPerformanceMonitor=OFF ``` Reviewers: Colin P. McCabe <cmccabe@apache.org>

soarez reviewed Mar 28, 2024

View reviewed changes

Comment thread metadata/src/main/java/org/apache/kafka/controller/metrics/SlowEventsLogger.java Outdated

Comment thread metadata/src/main/java/org/apache/kafka/controller/metrics/QuorumControllerMetrics.java Outdated

github-actions Bot added the stale Stale PRs label Aug 20, 2024

mumrah added 4 commits December 10, 2024 16:06

proof of concept

ec2ce56

wip

238ce5a

WIP

af04828

updates

c20ec56

mumrah force-pushed the KAFKA-16446-log-slow-events branch from fea5f7b to c20ec56 Compare December 12, 2024 15:24

Merge remote-tracking branch 'origin/trunk' into KAFKA-16446-log-slow…

0172e81

…-events

github-actions Bot added core Kafka Broker kraft labels Dec 12, 2024

mumrah marked this pull request as ready for review December 12, 2024 15:27

mumrah removed the stale Stale PRs label Dec 12, 2024

Merge remote-tracking branch 'origin/trunk' into KAFKA-16446-log-slow…

b71ab8b

…-events

mumrah requested review from cmccabe and soarez December 12, 2024 17:41

mumrah changed the title ~~Log slow controller events~~ KAFKA-16446 Log slow controller events Dec 12, 2024

soarez reviewed Dec 13, 2024

View reviewed changes

pr feedback

c010cb4

soarez approved these changes Dec 13, 2024

View reviewed changes

mumrah added 4 commits December 13, 2024 21:20

Merge remote-tracking branch 'origin/trunk' into KAFKA-16446-log-slow…

23dff09

…-events

pr feedback

d23e4af

Merge remote-tracking branch 'origin/trunk' into KAFKA-16446-log-slow…

7676dd4

…-events

set eventPerformanceMonitor earlier

a2ded76

Merge remote-tracking branch 'origin/trunk' into KAFKA-16446-log-slow…

17acae0

…-events

cmccabe reviewed Dec 19, 2024

View reviewed changes

Comment thread metadata/src/main/java/org/apache/kafka/controller/EventPerformanceMonitor.java Outdated

cmccabe reviewed Dec 19, 2024

View reviewed changes

Comment thread server/src/main/java/org/apache/kafka/server/config/KRaftConfigs.java Outdated

logging changes etc.

8993f10

cmccabe changed the title ~~KAFKA-16446 Log slow controller events~~ KAFKA-16446: Improve controller event duration logging Dec 20, 2024

mumrah commented Dec 23, 2024

View reviewed changes

cmccabe added 2 commits January 6, 2025 12:32

Merge branch 'trunk' into KAFKA-16446-log-slow-events

9323eef

nanosecondsToDecimalMillis -> formatNsAsDecimalMs

dee71c4

cmccabe merged commit c4840f5 into apache:trunk Jan 6, 2025

Conversation

mumrah commented Mar 28, 2024 • edited by cmccabe Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soarez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mumrah commented Mar 28, 2024

Uh oh!

soarez commented Mar 29, 2024

Uh oh!

github-actions Bot commented Aug 20, 2024

Uh oh!

mumrah commented Dec 12, 2024

Uh oh!

soarez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

soarez Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

mumrah Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mumrah commented Dec 13, 2024

Uh oh!

soarez left a comment

Choose a reason for hiding this comment

Uh oh!

cmccabe commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mumrah commented Dec 14, 2024

Uh oh!

Uh oh!

Uh oh!

mumrah Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmccabe Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

mumrah Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

cmccabe Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

mumrah Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mumrah commented Mar 28, 2024 •

edited by cmccabe

Loading

cmccabe commented Dec 13, 2024 •

edited

Loading

mumrah Dec 23, 2024 •

edited

Loading