Skip to content

KAFKA-16446: Improve controller event duration logging#15622

Merged
cmccabe merged 15 commits intoapache:trunkfrom
mumrah:KAFKA-16446-log-slow-events
Jan 6, 2025
Merged

KAFKA-16446: Improve controller event duration logging#15622
cmccabe merged 15 commits intoapache:trunkfrom
mumrah:KAFKA-16446-log-slow-events

Conversation

@mumrah
Copy link
Copy Markdown
Member

@mumrah mumrah commented Mar 28, 2024

There are times when the controller has a high event processing time such as during startup or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time.

With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time.

This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this:

[2024-12-20 15:03:39,754] ERROR [QuorumController id=1] Exceptionally slow controller event createTopics took 5240 ms.  (org.apache.kafka.controller.EventPerformanceMonitor)

Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this:

[2024-12-20 15:35:04,798] INFO [QuorumController id=1] In the last 60000 ms period, 333 events were completed, which took an average of 12.34 ms each. The slowest event was handleCommit[baseOffset=0], which took 41.90 ms. (org.apache.kafka.controller.EventPerformanceMonitor)

An operator can disable these logs by adding the following to their log4j config:

org.apache.kafka.controller.EventPerformanceMonitor=OFF

Copy link
Copy Markdown
Member

@soarez soarez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider tagging controllerMetrics.updateEventQueueProcessingTime with the event name instead? That would seem like a more general and useful solution.

Comment thread metadata/src/main/java/org/apache/kafka/controller/metrics/SlowEventsLogger.java Outdated
@mumrah
Copy link
Copy Markdown
Member Author

mumrah commented Mar 28, 2024

@soarez wdym by tagging here?

Note that the events have relatively unique names (we include offset in some) so the cardinality is quite high.

@soarez
Copy link
Copy Markdown
Member

soarez commented Mar 29, 2024

events have relatively unique names (we include offset in some) so the cardinality is quite high

I see. What I meant would result in too many more metrics. Makes sense then.

@github-actions
Copy link
Copy Markdown

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions Bot added the stale Stale PRs label Aug 20, 2024
@mumrah mumrah force-pushed the KAFKA-16446-log-slow-events branch from fea5f7b to c20ec56 Compare December 12, 2024 15:24
@github-actions github-actions Bot added core Kafka Broker kraft labels Dec 12, 2024
@mumrah mumrah marked this pull request as ready for review December 12, 2024 15:27
@mumrah
Copy link
Copy Markdown
Member Author

mumrah commented Dec 12, 2024

@soarez i've brushed off this old PR. PTAL

@mumrah mumrah removed the stale Stale PRs label Dec 12, 2024
@mumrah mumrah requested review from cmccabe and soarez December 12, 2024 17:41
@mumrah mumrah changed the title Log slow controller events KAFKA-16446 Log slow controller events Dec 12, 2024
Copy link
Copy Markdown
Member

@soarez soarez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah thanks for updating this

Comment thread server/src/main/java/org/apache/kafka/server/config/KRaftConfigs.java Outdated
Comment on lines -178 to +180
* The maximum records that the controller will write in a single batch.
* The default maximum records that the controller will write in a single batch.
*/
private static final int MAX_RECORDS_PER_BATCH = 10000;
private static final int DEFAULT_MAX_RECORDS_PER_BATCH = 10000;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though there are no code paths that change the value, within the scope of QuorumController.Builder this is in fact a default.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your comment. Are you just agreeing with the change to DEFAULT_ or something else?

Comment thread metadata/src/main/java/org/apache/kafka/controller/QuorumController.java Outdated
Comment thread metadata/src/main/java/org/apache/kafka/controller/SlowEventsLogger.java Outdated
Comment thread metadata/src/main/java/org/apache/kafka/controller/SlowEventsLogger.java Outdated
Comment thread metadata/src/main/java/org/apache/kafka/controller/SlowEventsLogger.java Outdated
@mumrah
Copy link
Copy Markdown
Member Author

mumrah commented Dec 13, 2024

@soarez thanks for the review. Addressed your comments

Copy link
Copy Markdown
Member

@soarez soarez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed CI tests seem unrelated, and wasn't able to reproduce them locally, so LGTM.

Thanks @mumrah!

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Dec 13, 2024

I was a bit confused by this PR since I thought that it was going to be about setting a threshold (like 5 seconds or something) and saying that any event that lasted longer than that was so bad, so egregious, that it should always be logged. (Tangent: I still actually do think we should do that!)

But instead this is more like having a background process gathering stats. Which is actually quite useful in its own way. Question, though: why can't we set the logging interval to 60 seconds and just log the longest event unconditionally? It's not that much noise and people can turn it off if they want to. (by setting the log4j for this class)

There's probably a bunch of other stuff we could do like try to get the "top 5" rather than just the top event. But we can always do that in a follow-on, no need to block this one.

I think we should rename SlowEventsLogger since it's more about general performance stats and not just about slow events. Also calling it "logger" sort of makes it sound like part of log4j. Perhaps call it EventPerformanceMonitor?

@mumrah
Copy link
Copy Markdown
Member Author

mumrah commented Dec 14, 2024

@cmccabe Thanks for taking a look!

any event that lasted longer than that was so bad, so egregious, that it should always be logged

I thought about this for a while and couldn’t come up with a good threshold. Looking at our CCloud data some clusters run at 10ms average event times so an event of 200ms would be interesting to observe. Other clusters, we are seeing average of 100ms event times, so 200ms isn’t so interesting. That’s what led me to taking a statistical approach.

However, we could definitely add an “always log above this threshold” as a separate thing (with a unique log line).

Question, though: why can’t we set the logging interval to 60 seconds and just log the longest event unconditionally?

We could, though that could make finding some rare event a bit more difficult. Also, if we had a burst of slow events, we would only log one instead of all that were above p99 (rare, but possible due to the histogram behavior).

Perhaps call it EventPerformanceMonitor?

Seems fine to me. Like you said, we could evolve this to capture more stuff in the future.

Comment thread metadata/src/main/java/org/apache/kafka/controller/EventPerformanceMonitor.java Outdated
Comment thread server/src/main/java/org/apache/kafka/server/config/KRaftConfigs.java Outdated
@cmccabe cmccabe changed the title KAFKA-16446 Log slow controller events KAFKA-16446: Improve controller event duration logging Dec 20, 2024
/**
* The total duration of all the events we've seen.
*/
private long totalEventDurationNs;
Copy link
Copy Markdown
Member Author

@mumrah mumrah Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recording this is interesting. Since we have a single controller thread and can see its total event duration over a fixed period, we could calculate the idle percentage of the controller. Might be interesting to look at as a high level "busy" metric

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that might be interesting. It would probably be better to get idle numbers directly from the queue somehow (since it knows how long it waited...) just to avoid errors adding up over time.

/**
* The always-log threshold in nanoseconds.
*/
private long alwaysLogThresholdNs;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason I didn't go with a fixed threshold originally was that I was concerned about a negative feedback loop with a congested controller. For example, if all the events on the controller are above this threshold due to some bug or external influence, doing the extra logging just makes things worse.

However, thinking about it more, maybe the extra logging wouldn't be such a hit. If the controller is congested and the threshold is a few seconds, then the extra few ms to do logging wouldn't change the situation much.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Once you're at 2 seconds per event, another log message isn't going to change much. Also, I hope we're logging in less than a few ms, although with these logging libraries, ya never know...

* @param durationNs The duration in nanoseconds.
* @return The decimal duration in milliseconds.
*/
static String nanosecondsToDecimalMillis(long durationNs) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: maybe "formatNanosAsMillis" or something with "format" in the name?

@cmccabe cmccabe merged commit c4840f5 into apache:trunk Jan 6, 2025
ijuma added a commit to ijuma/kafka that referenced this pull request Jan 8, 2025
…og-compaction-write-record-v2

* apache-github/trunk: (34 commits)
  MINOR: Bump year to 2025 in NOTICE file (apache#18427)
  KAFKA-18411 Remove ZkProducerIdManager (apache#18413)
  KAFKA-18408 tweak the 'tag' field for BrokerHeartbeatRequest.json, BrokerRegistrationChangeRecord.json and RegisterBrokerRecord.json (apache#18421)
  KAFKA-18414 Remove KRaftRegistrationResult (apache#18401)
  KAFKA-17921 Support SASL_PLAINTEXT protocol with java.security.auth.login.config (apache#17671)
  KAFKA-18384 Remove ZkAlterPartitionManager (apache#18364)
  KAFKA-10790: Add deadlock detection to producer#flush (apache#17946)
  KAFKA-18412: Remove EmbeddedZookeeper (apache#18399)
  MINOR : Improve Exception log in NotEnoughReplicasException(apache#12394)
  MINOR: Improve PlaintextAdminIntegrationTest#testConsumerGroups (apache#18409)
  MINOR: Remove unused local variable (apache#18410)
  MINOR: Remove RaftManager.maybeDeleteMetadataLogDir and AutoTopicCreationManagerTest.scala (apache#17365)
  KAFKA-18368 Remove TestUtils#MockZkConnect and remove zkConnect from TestUtils#createBrokerConfig (apache#18352)
  MINOR: Update Consumer group timeout default to 30 sec (apache#16406)
  MINOR: Fix typo in CommitRequestManager (apache#18407)
  MINOR: cleanup JavaDocs for deprecation warnings (apache#18402)
  KAFKA-18303; Update ShareCoordinator to use new record format (apache#18396)
  MINOR: Update Consumer and Producer JavaDocs for committing offsets (apache#18336)
  KAFKA-16446: Improve controller event duration logging (apache#15622)
  KAFKA-18388 test-kraft-server-start.sh should use log4j2.yaml (apache#18370)
  ...
cmccabe pushed a commit that referenced this pull request Feb 4, 2025
There are times when the controller has a high event processing time, such as during startup, or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time.

With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time.

This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this:

```
[2024-12-20 15:03:39,754] ERROR [QuorumController id=1] Exceptionally slow controller event createTopics took 5240 ms.  (org.apache.kafka.controller.EventPerformanceMonitor)
```

Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this:

```
[2024-12-20 15:35:04,798] INFO [QuorumController id=1] In the last 60000 ms period, 333 events were completed, which took an average of 12.34 ms each. The slowest event was handleCommit[baseOffset=0], which took 41.90 ms. (org.apache.kafka.controller.EventPerformanceMonitor)
```

An operator can disable these logs by adding the following to their log4j config:

```
org.apache.kafka.controller.EventPerformanceMonitor=OFF
```

Reviewers: Colin P. McCabe <cmccabe@apache.org>
manoj-mathivanan pushed a commit to manoj-mathivanan/kafka that referenced this pull request Feb 19, 2025
There are times when the controller has a high event processing time, such as during startup, or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time.

With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time.

This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this:

```
[2024-12-20 15:03:39,754] ERROR [QuorumController id=1] Exceptionally slow controller event createTopics took 5240 ms.  (org.apache.kafka.controller.EventPerformanceMonitor)
```

Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this:

```
[2024-12-20 15:35:04,798] INFO [QuorumController id=1] In the last 60000 ms period, 333 events were completed, which took an average of 12.34 ms each. The slowest event was handleCommit[baseOffset=0], which took 41.90 ms. (org.apache.kafka.controller.EventPerformanceMonitor)
```

An operator can disable these logs by adding the following to their log4j config:

```
org.apache.kafka.controller.EventPerformanceMonitor=OFF
```

Reviewers: Colin P. McCabe <cmccabe@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Kafka Broker kraft

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants