KAFKA-16446: Improve controller event duration logging#15622
KAFKA-16446: Improve controller event duration logging#15622cmccabe merged 15 commits intoapache:trunkfrom
Conversation
soarez
left a comment
There was a problem hiding this comment.
Did you consider tagging controllerMetrics.updateEventQueueProcessingTime with the event name instead? That would seem like a more general and useful solution.
|
@soarez wdym by tagging here? Note that the events have relatively unique names (we include offset in some) so the cardinality is quite high. |
I see. What I meant would result in too many more metrics. Makes sense then. |
|
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
fea5f7b to
c20ec56
Compare
|
@soarez i've brushed off this old PR. PTAL |
| * The maximum records that the controller will write in a single batch. | ||
| * The default maximum records that the controller will write in a single batch. | ||
| */ | ||
| private static final int MAX_RECORDS_PER_BATCH = 10000; | ||
| private static final int DEFAULT_MAX_RECORDS_PER_BATCH = 10000; |
There was a problem hiding this comment.
Even though there are no code paths that change the value, within the scope of QuorumController.Builder this is in fact a default.
There was a problem hiding this comment.
I'm not sure I understand your comment. Are you just agreeing with the change to DEFAULT_ or something else?
|
@soarez thanks for the review. Addressed your comments |
|
I was a bit confused by this PR since I thought that it was going to be about setting a threshold (like 5 seconds or something) and saying that any event that lasted longer than that was so bad, so egregious, that it should always be logged. (Tangent: I still actually do think we should do that!) But instead this is more like having a background process gathering stats. Which is actually quite useful in its own way. Question, though: why can't we set the logging interval to 60 seconds and just log the longest event unconditionally? It's not that much noise and people can turn it off if they want to. (by setting the log4j for this class) There's probably a bunch of other stuff we could do like try to get the "top 5" rather than just the top event. But we can always do that in a follow-on, no need to block this one. I think we should rename |
|
@cmccabe Thanks for taking a look!
I thought about this for a while and couldn’t come up with a good threshold. Looking at our CCloud data some clusters run at 10ms average event times so an event of 200ms would be interesting to observe. Other clusters, we are seeing average of 100ms event times, so 200ms isn’t so interesting. That’s what led me to taking a statistical approach. However, we could definitely add an “always log above this threshold” as a separate thing (with a unique log line).
We could, though that could make finding some rare event a bit more difficult. Also, if we had a burst of slow events, we would only log one instead of all that were above p99 (rare, but possible due to the histogram behavior).
Seems fine to me. Like you said, we could evolve this to capture more stuff in the future. |
| /** | ||
| * The total duration of all the events we've seen. | ||
| */ | ||
| private long totalEventDurationNs; |
There was a problem hiding this comment.
Recording this is interesting. Since we have a single controller thread and can see its total event duration over a fixed period, we could calculate the idle percentage of the controller. Might be interesting to look at as a high level "busy" metric
There was a problem hiding this comment.
Yeah, that might be interesting. It would probably be better to get idle numbers directly from the queue somehow (since it knows how long it waited...) just to avoid errors adding up over time.
| /** | ||
| * The always-log threshold in nanoseconds. | ||
| */ | ||
| private long alwaysLogThresholdNs; |
There was a problem hiding this comment.
One reason I didn't go with a fixed threshold originally was that I was concerned about a negative feedback loop with a congested controller. For example, if all the events on the controller are above this threshold due to some bug or external influence, doing the extra logging just makes things worse.
However, thinking about it more, maybe the extra logging wouldn't be such a hit. If the controller is congested and the threshold is a few seconds, then the extra few ms to do logging wouldn't change the situation much.
There was a problem hiding this comment.
Right. Once you're at 2 seconds per event, another log message isn't going to change much. Also, I hope we're logging in less than a few ms, although with these logging libraries, ya never know...
| * @param durationNs The duration in nanoseconds. | ||
| * @return The decimal duration in milliseconds. | ||
| */ | ||
| static String nanosecondsToDecimalMillis(long durationNs) { |
There was a problem hiding this comment.
naming: maybe "formatNanosAsMillis" or something with "format" in the name?
…og-compaction-write-record-v2 * apache-github/trunk: (34 commits) MINOR: Bump year to 2025 in NOTICE file (apache#18427) KAFKA-18411 Remove ZkProducerIdManager (apache#18413) KAFKA-18408 tweak the 'tag' field for BrokerHeartbeatRequest.json, BrokerRegistrationChangeRecord.json and RegisterBrokerRecord.json (apache#18421) KAFKA-18414 Remove KRaftRegistrationResult (apache#18401) KAFKA-17921 Support SASL_PLAINTEXT protocol with java.security.auth.login.config (apache#17671) KAFKA-18384 Remove ZkAlterPartitionManager (apache#18364) KAFKA-10790: Add deadlock detection to producer#flush (apache#17946) KAFKA-18412: Remove EmbeddedZookeeper (apache#18399) MINOR : Improve Exception log in NotEnoughReplicasException(apache#12394) MINOR: Improve PlaintextAdminIntegrationTest#testConsumerGroups (apache#18409) MINOR: Remove unused local variable (apache#18410) MINOR: Remove RaftManager.maybeDeleteMetadataLogDir and AutoTopicCreationManagerTest.scala (apache#17365) KAFKA-18368 Remove TestUtils#MockZkConnect and remove zkConnect from TestUtils#createBrokerConfig (apache#18352) MINOR: Update Consumer group timeout default to 30 sec (apache#16406) MINOR: Fix typo in CommitRequestManager (apache#18407) MINOR: cleanup JavaDocs for deprecation warnings (apache#18402) KAFKA-18303; Update ShareCoordinator to use new record format (apache#18396) MINOR: Update Consumer and Producer JavaDocs for committing offsets (apache#18336) KAFKA-16446: Improve controller event duration logging (apache#15622) KAFKA-18388 test-kraft-server-start.sh should use log4j2.yaml (apache#18370) ...
There are times when the controller has a high event processing time, such as during startup, or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time. With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time. This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this: ``` [2024-12-20 15:03:39,754] ERROR [QuorumController id=1] Exceptionally slow controller event createTopics took 5240 ms. (org.apache.kafka.controller.EventPerformanceMonitor) ``` Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this: ``` [2024-12-20 15:35:04,798] INFO [QuorumController id=1] In the last 60000 ms period, 333 events were completed, which took an average of 12.34 ms each. The slowest event was handleCommit[baseOffset=0], which took 41.90 ms. (org.apache.kafka.controller.EventPerformanceMonitor) ``` An operator can disable these logs by adding the following to their log4j config: ``` org.apache.kafka.controller.EventPerformanceMonitor=OFF ``` Reviewers: Colin P. McCabe <cmccabe@apache.org>
There are times when the controller has a high event processing time, such as during startup, or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time. With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time. This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this: ``` [2024-12-20 15:03:39,754] ERROR [QuorumController id=1] Exceptionally slow controller event createTopics took 5240 ms. (org.apache.kafka.controller.EventPerformanceMonitor) ``` Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this: ``` [2024-12-20 15:35:04,798] INFO [QuorumController id=1] In the last 60000 ms period, 333 events were completed, which took an average of 12.34 ms each. The slowest event was handleCommit[baseOffset=0], which took 41.90 ms. (org.apache.kafka.controller.EventPerformanceMonitor) ``` An operator can disable these logs by adding the following to their log4j config: ``` org.apache.kafka.controller.EventPerformanceMonitor=OFF ``` Reviewers: Colin P. McCabe <cmccabe@apache.org>
There are times when the controller has a high event processing time such as during startup or when creating a topic with many partitions. We can see these processing times in the p99 metric (kafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMs), however it's difficult to see exactly which event is causing high processing time.
With DEBUG logs, we see every event along with its processing time. Even with this, it's a bit tedious to find the event with a high processing time.
This PR logs all events which take longer than 2 seconds at ERROR level. This will help identify events that are taking far too long, and which could be disruptive to the operation of the controller. The slow event logging looks like this:
Also, every 60 seconds, it logs some event time statistics, including average time, maximum time, and the name of the event which took the longest. This periodic message looks like this:
An operator can disable these logs by adding the following to their log4j config: