KAFKA-7514: Add threads to ConsumeBenchWorker#5864
KAFKA-7514: Add threads to ConsumeBenchWorker#5864cmccabe merged 17 commits intoapache:trunkfrom stanislavkozlovski:KAFKA-7514
Conversation
Also call consumer#unsubscribe() and consumer#close() on finishing the task
…ilure Abort the worker if every status updater fails
Also bundle up clientId and KafkaConsumer into a single class
…private Consumer class
| ConsumerRecords<byte[], byte[]> records = consumer.poll(Duration.ofMillis(50)); | ||
| while (messagesConsumed < maxMessages) { | ||
|
|
||
| ConsumerRecords<byte[], byte[]> records = consumer.poll(); |
There was a problem hiding this comment.
This raises an InterruptException and I can't figure out why. Is there something obvious that I'm missing?
[2018-11-05 17:47:05,859] WARN ConsumeRecords caught an exception: (org.apache.kafka.trogdor.workload.ConsumeBenchWorker)
org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:493)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:281)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1243)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1188)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1164)
at org.apache.kafka.trogdor.workload.ConsumeBenchWorker$Consumer.poll(ConsumeBenchWorker.java:503)
at org.apache.kafka.trogdor.workload.ConsumeBenchWorker$ConsumeMessages.call(ConsumeBenchWorker.java:243)
at org.apache.kafka.trogdor.workload.ConsumeBenchWorker$ConsumeMessages.call(ConsumeBenchWorker.java:198)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:844)
Caused by: java.lang.InterruptedException
... 14 more
There was a problem hiding this comment.
Most likely something somewhere else failed, and your thread was sent the interrupt exception as part of tearing down the worker.
There was a problem hiding this comment.
I am not sure about the cause of this since I haven't been able to reproduce it since. I had changed the parameters in trogdor-run-consume-bench.sh. Might have been something to do with old consumer group state after having re-ran it a couple of times
| startBatchMs = Time.SYSTEM.milliseconds(); | ||
| } | ||
| } catch (Exception e) { | ||
| // TODO: Should we close task on consumer failure? |
There was a problem hiding this comment.
This is an open question - I'm not sure what's better - to abort the whole task if one consumer fails or continue running until the last one fails
There was a problem hiding this comment.
Please abort the whole task if the consumer fails. That's how the rest of the tests work.
In general, the consumer should never fail-- if it does, there is a serious problem.
There was a problem hiding this comment.
Makes sense. I have logic implemented that does not fail the whole task is a StatusUpdater of one consumer fails - only when all StatusUpdaters fail does the task get aborted. Thoughts on that?
There was a problem hiding this comment.
The status updaters should not fail, right? Any failure should abort the worker.
There was a problem hiding this comment.
Seems reasonable. I reverted my commit
|
Please let's use "thread count" instead of "consumer count." We may eventually want to extend this job to be able to spawn multiple workers (if it isn't already). I think the questions above about how max messages per sec, etc. work are resolved, right? They apply on a per-thread basis. We should add some JavaDoc specifying this. |
|
Thanks for the review @cmccabe. The questions on maxMessages and etc are resolves, yes. I have already added a JavaDoc explaining this in |
…pdate failure" This reverts commit 63ebaa9.
| @JsonProperty("consumerConf") Map<String, String> consumerConf, | ||
| @JsonProperty("commonClientConf") Map<String, String> commonClientConf, | ||
| @JsonProperty("adminClientConf") Map<String, String> adminClientConf, | ||
| @JsonProperty("threadCount") Integer threadCount, |
There was a problem hiding this comment.
Can we call this "threadsPerWorker" instead? We will probably want to support multiple workers in the future.
|
|
||
| @JsonProperty | ||
| public int consumerCount() { | ||
| return consumerCount; |
There was a problem hiding this comment.
We should make this consistent with the name of the varialbe in the spec
| * It is worth noting that the "targetMessagesPerSec", "maxMessages" and "activeTopics" fields apply for every consumer individually. | ||
| * If a consumer group is not specified, every consumer is assigned a different, random group. When specified, all consumers use the same group. | ||
| * Specifying partitions, a consumer group and multiple consumers will result in an #{@link ConfigException} and the task will abort. | ||
| * |
There was a problem hiding this comment.
It would probably be good to have a separate paragraph to describe the group configuration.
We probably want to mention that if individual partitions are specified, only a single consumer is supported, and subscribe is not used.
| this.status = status; | ||
| spec.consumerCount() + 2, // 1 thread for all the ConsumeStatusUpdater and 1 for the StatusUpdater | ||
| ThreadUtils.createThreadFactory("ConsumeBenchWorkerThread%d", false)); | ||
| this.statusUpdaterFuture = executor.scheduleAtFixedRate(this.statusUpdater, 2, 1, TimeUnit.MINUTES); |
There was a problem hiding this comment.
Hmm. Why change the initial delay from 1 minute to 2?
There was a problem hiding this comment.
Whoops, I initially thought this was seconds. My reasoning was that since there are more threads spawned and a rebalance happening, it would make sense to wait a bit more before exposing the status.
Anyway, reverting this back to 1 minute
| boolean toUseGroupPartitionAssignment = partitionsByTopic.values().stream().allMatch(List::isEmpty); | ||
|
|
||
| if (!toUseGroupPartitionAssignment && !toUseRandomConsumeGroup() && consumerCount > 1) | ||
| throw new ConfigException(String.format("Will not split partitions across consumers from the %s group", consumerGroup)); |
There was a problem hiding this comment.
This exception seems a bit confusing.
Maybe something like "You may not specify an explicit partition assignment when using multiple consumers in the same group. Please either leave the consumer group unset, or specify topics instead of partitions."
|
|
||
| String clientId = clientId(0); | ||
| consumer = consumer(consumerGroup, clientId); | ||
| if (!toUseGroupPartitionAssignment) { |
There was a problem hiding this comment.
Since we're handling both the case where this is true, and where it is false, we might as well test for (toUseGroupPartitionAssignment) rather than (!toUseGroupPartitionAssignment)
There was a problem hiding this comment.
True, that's more readable
| */ | ||
| private ThreadSafeConsumer consumer(String consumerGroup, String idx) { | ||
| Properties props = new Properties(); | ||
| String clientId = String.format("consumer.%s-%s", id, idx); |
There was a problem hiding this comment.
This seems to duplicate the code in the clientId(int idx) function?
| private String generateConsumerGroup() { | ||
| return "consume-bench-" + UUID.randomUUID().toString(); | ||
| private boolean toUseRandomConsumeGroup() { | ||
| return spec.consumerGroup().equals(ConsumeBenchSpec.EMPTY_CONSUMER_GROUP); |
There was a problem hiding this comment.
You should be able to just say spec.consumerGroup().isEmpty()
| } | ||
| new ConsumeStatusUpdater(latencyHistogram, messageSizeHistogram, consumer), 1, 1, TimeUnit.MINUTES); | ||
| int perPeriod; | ||
| if (spec.targetMessagesPerSec() == 0) |
| latencyHistogram.add(elapsedBatchMs); | ||
| messageSizeHistogram.add(messageBytes); | ||
| bytesConsumed += messageBytes; | ||
| if (messagesConsumed >= maxMessages) |
There was a problem hiding this comment.
Why is this needed? The loop condition is the same.
There was a problem hiding this comment.
The default max.poll.records is 500. This would result in the consumer always consuming 500 at a minimum (provided the topic had the records) and not respect the cases where maxMessages was less than that
| /** | ||
| * A thread-safe KafkaConsumer wrapper | ||
| */ | ||
| private static class ThreadSafeConsumer { |
There was a problem hiding this comment.
This is not needed. Only one thread at a time should be accessing the consumer.
There was a problem hiding this comment.
The ConsumeStatusUpdater calls KafkaConsumer#assignment() dynamically to get the latest assignments for a consumer.
My reasoning was that there might be edge cases where these assignments change during a benchmark run (e.g one consumer finishes early, starts late, etc).
That might not be needed in practice, though. WDYT?
| ConsumerRecords<byte[], byte[]> poll() { | ||
| this.consumerLock.lock(); | ||
| try { | ||
| return consumer.poll(Duration.ofMillis(50)); |
There was a problem hiding this comment.
50 ms is way too short for a poll interval. We don't want to use up so much CPU. If there's nothing to consume it should wait for at least a minute.
There was a problem hiding this comment.
That's how it before but good point. If we remove the lock we should go with this
Better use of clientId() method Rename threadCount to threadsPerWorker Improve ConsumeBenchSpec javadoc
| def __init__(self, start_ms, duration_ms, consumer_node, bootstrap_servers, | ||
| target_messages_per_sec, max_messages, active_topics, | ||
| consumer_conf, common_client_conf, admin_client_conf, consumer_group=None): | ||
| consumer_conf, common_client_conf, admin_client_conf, consumer_group=None, consumer_count=1): |
There was a problem hiding this comment.
Should be "threadsPerWorker"
| consumer_conf={}, | ||
| admin_client_conf={}, | ||
| common_client_conf={}, | ||
| consumer_count=2, |
There was a problem hiding this comment.
Same (should be threads_per_worker)
|
Thanks @stanislavkozlovski . Looks good. Should be ready to commit once the variable name is fixed in the python code. |
|
@cmccabe the JDK 8 tests passed and the JDK11 failures look like unrelated flaky tests |
Add threads with separate consumers to ConsumeBenchWorker. Update the Trogdor test scripts and documentation with the new functionality. Reviewers: Colin McCabe <cmccabe@apache.org>
This PR adds a new ConsumeBenchSpec field - "consumerCount". "consumerCount" will be spawned over "consumerCount" threads in the ConsumeBenchWorker.
Since "consumerCount" will be 1 by default, these changes are backwards compatible
It's now questionable how existing fields such as "targetMessagesPerSec", "maxMessages", "consumerGroup" and "activeTopics" should work.
With "activeTopics", we need to decide whether they should be split over the consumers or not.
I see 4 cases which I believe we should address like this:
I believe "targetMessagesPerSec", "maxMessages" should account for each consumer individually. This would ease implementation by a ton, too.
I haven't written tests yet since I want to flesh out the design first. Any feedback is appreciated