KAFKA-7514: Add threads to ConsumeBenchWorker by stanislavkozlovski · Pull Request #5864 · apache/kafka

stanislavkozlovski · 2018-10-31T23:23:11Z

This PR adds a new ConsumeBenchSpec field - "consumerCount". "consumerCount" will be spawned over "consumerCount" threads in the ConsumeBenchWorker.
Since "consumerCount" will be 1 by default, these changes are backwards compatible

It's now questionable how existing fields such as "targetMessagesPerSec", "maxMessages", "consumerGroup" and "activeTopics" should work.

With "activeTopics", we need to decide whether they should be split over the consumers or not.
I see 4 cases which I believe we should address like this:

Random group, subscribe to topics - N unique groups all subscribed to all topics
Specifed group, subscribe to topics - 1 group subscribed to all topics. Consumers share workload.
Random groups, assign partitions - X groups all subscribed to all partitions
Specified group, assign partitions - Throw an exception. Only one consumer can read from a specific partition within the context of a consumer group. It is then unclear how and whether at all we should split the partitions across the consumers. At this phase, I believe it's best to not support this.

I believe "targetMessagesPerSec", "maxMessages" should account for each consumer individually. This would ease implementation by a ton, too.

I haven't written tests yet since I want to flesh out the design first. Any feedback is appreciated

Also call consumer#unsubscribe() and consumer#close() on finishing the task

…ilure Abort the worker if every status updater fails

Also bundle up clientId and KafkaConsumer into a single class

…private Consumer class

stanislavkozlovski · 2018-11-05T17:57:04Z

-                    ConsumerRecords<byte[], byte[]> records = consumer.poll(Duration.ofMillis(50));
+                while (messagesConsumed < maxMessages) {
+
+                    ConsumerRecords<byte[], byte[]> records = consumer.poll();


This raises an InterruptException and I can't figure out why. Is there something obvious that I'm missing?

[2018-11-05 17:47:05,859] WARN ConsumeRecords caught an exception: (org.apache.kafka.trogdor.workload.ConsumeBenchWorker) org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:493) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:281) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1243) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1188) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1164) at org.apache.kafka.trogdor.workload.ConsumeBenchWorker$Consumer.poll(ConsumeBenchWorker.java:503) at org.apache.kafka.trogdor.workload.ConsumeBenchWorker$ConsumeMessages.call(ConsumeBenchWorker.java:243) at org.apache.kafka.trogdor.workload.ConsumeBenchWorker$ConsumeMessages.call(ConsumeBenchWorker.java:198) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:844) Caused by: java.lang.InterruptedException ... 14 more

Most likely something somewhere else failed, and your thread was sent the interrupt exception as part of tearing down the worker.

I am not sure about the cause of this since I haven't been able to reproduce it since. I had changed the parameters in trogdor-run-consume-bench.sh. Might have been something to do with old consumer group state after having re-ran it a couple of times

stanislavkozlovski · 2018-11-05T17:57:23Z

                    startBatchMs = Time.SYSTEM.milliseconds();
                }
            } catch (Exception e) {
+                // TODO: Should we close task on consumer failure?


This is an open question - I'm not sure what's better - to abort the whole task if one consumer fails or continue running until the last one fails

Please abort the whole task if the consumer fails. That's how the rest of the tests work.

In general, the consumer should never fail-- if it does, there is a serious problem.

Makes sense. I have logic implemented that does not fail the whole task is a StatusUpdater of one consumer fails - only when all StatusUpdaters fail does the task get aborted. Thoughts on that?

The status updaters should not fail, right? Any failure should abort the worker.

Seems reasonable. I reverted my commit

cmccabe · 2018-11-05T19:26:40Z

Please let's use "thread count" instead of "consumer count." We may eventually want to extend this job to be able to spawn multiple workers (if it isn't already).

I think the questions above about how max messages per sec, etc. work are resolved, right? They apply on a per-thread basis. We should add some JavaDoc specifying this.

stanislavkozlovski · 2018-11-05T19:33:32Z

Thanks for the review @cmccabe. The questions on maxMessages and etc are resolves, yes. I have already added a JavaDoc explaining this in ConsumeBenchSpec.java - please take a look to see if they are adequate

…pdate failure" This reverts commit 63ebaa9.

…s warning)

cmccabe · 2018-11-06T16:28:12Z

                            @JsonProperty("consumerConf") Map<String, String> consumerConf,
                            @JsonProperty("commonClientConf") Map<String, String> commonClientConf,
                            @JsonProperty("adminClientConf") Map<String, String> adminClientConf,
+                            @JsonProperty("threadCount") Integer threadCount,


Can we call this "threadsPerWorker" instead? We will probably want to support multiple workers in the future.

cmccabe · 2018-11-06T16:29:22Z


+    @JsonProperty
+    public int consumerCount() {
+        return consumerCount;


We should make this consistent with the name of the varialbe in the spec

cmccabe · 2018-11-06T16:33:35Z

+ * It is worth noting that the "targetMessagesPerSec", "maxMessages" and "activeTopics" fields apply for every consumer individually.
+ * If a consumer group is not specified, every consumer is assigned a different, random group. When specified, all consumers use the same group.
+ * Specifying partitions, a consumer group and multiple consumers will result in an #{@link ConfigException} and the task will abort.
+ *


It would probably be good to have a separate paragraph to describe the group configuration.

We probably want to mention that if individual partitions are specified, only a single consumer is supported, and subscribe is not used.

cmccabe · 2018-11-06T16:35:41Z

-        this.status = status;
+            spec.consumerCount() + 2, // 1 thread for all the ConsumeStatusUpdater and 1 for the StatusUpdater
+            ThreadUtils.createThreadFactory("ConsumeBenchWorkerThread%d", false));
+        this.statusUpdaterFuture = executor.scheduleAtFixedRate(this.statusUpdater, 2, 1, TimeUnit.MINUTES);


Hmm. Why change the initial delay from 1 minute to 2?

Whoops, I initially thought this was seconds. My reasoning was that since there are more threads spawned and a rebalance happening, it would make sense to wait a bit more before exposing the status.
Anyway, reverting this back to 1 minute

cmccabe · 2018-11-06T16:39:11Z

+            boolean toUseGroupPartitionAssignment = partitionsByTopic.values().stream().allMatch(List::isEmpty);
+
+            if (!toUseGroupPartitionAssignment && !toUseRandomConsumeGroup() && consumerCount > 1)
+                throw new ConfigException(String.format("Will not split partitions across consumers from the %s group", consumerGroup));


This exception seems a bit confusing.

Maybe something like "You may not specify an explicit partition assignment when using multiple consumers in the same group. Please either leave the consumer group unset, or specify topics instead of partitions."

cmccabe · 2018-11-06T16:40:43Z

+
+            String clientId = clientId(0);
+            consumer = consumer(consumerGroup, clientId);
+            if (!toUseGroupPartitionAssignment) {


Since we're handling both the case where this is true, and where it is false, we might as well test for (toUseGroupPartitionAssignment) rather than (!toUseGroupPartitionAssignment)

True, that's more readable

cmccabe · 2018-11-06T16:42:02Z

+         */
+        private ThreadSafeConsumer consumer(String consumerGroup, String idx) {
            Properties props = new Properties();
+            String clientId = String.format("consumer.%s-%s", id, idx);


This seems to duplicate the code in the clientId(int idx) function?

cmccabe · 2018-11-06T16:44:00Z

-        private String generateConsumerGroup() {
-            return "consume-bench-" + UUID.randomUUID().toString();
+        private boolean toUseRandomConsumeGroup() {
+            return spec.consumerGroup().equals(ConsumeBenchSpec.EMPTY_CONSUMER_GROUP);


You should be able to just say spec.consumerGroup().isEmpty()

cmccabe · 2018-11-06T16:44:56Z

-            }
+                new ConsumeStatusUpdater(latencyHistogram, messageSizeHistogram, consumer), 1, 1, TimeUnit.MINUTES);
+            int perPeriod;
+            if (spec.targetMessagesPerSec() == 0)


Suggest using <= 0

cmccabe · 2018-11-06T16:45:50Z

                        latencyHistogram.add(elapsedBatchMs);
                        messageSizeHistogram.add(messageBytes);
                        bytesConsumed += messageBytes;
+                        if (messagesConsumed >= maxMessages)


Why is this needed? The loop condition is the same.

The default max.poll.records is 500. This would result in the consumer always consuming 500 at a minimum (provided the topic had the records) and not respect the cases where maxMessages was less than that

cmccabe · 2018-11-06T16:49:51Z

+    /**
+     * A thread-safe KafkaConsumer wrapper
+     */
+    private static class ThreadSafeConsumer {


This is not needed. Only one thread at a time should be accessing the consumer.

The ConsumeStatusUpdater calls KafkaConsumer#assignment() dynamically to get the latest assignments for a consumer.
My reasoning was that there might be edge cases where these assignments change during a benchmark run (e.g one consumer finishes early, starts late, etc).
That might not be needed in practice, though. WDYT?

cmccabe · 2018-11-06T16:50:17Z

+        ConsumerRecords<byte[], byte[]> poll() {
+            this.consumerLock.lock();
+            try {
+                return consumer.poll(Duration.ofMillis(50));


50 ms is way too short for a poll interval. We don't want to use up so much CPU. If there's nothing to consume it should wait for at least a minute.

That's how it before but good point. If we remove the lock we should go with this

Better use of clientId() method Rename threadCount to threadsPerWorker Improve ConsumeBenchSpec javadoc

cmccabe · 2018-11-09T18:35:13Z

    def __init__(self, start_ms, duration_ms, consumer_node, bootstrap_servers,
                 target_messages_per_sec, max_messages, active_topics,
-                 consumer_conf, common_client_conf, admin_client_conf, consumer_group=None):
+                 consumer_conf, common_client_conf, admin_client_conf, consumer_group=None, consumer_count=1):


Should be "threadsPerWorker"

cmccabe · 2018-11-09T18:36:16Z

                                                consumer_conf={},
                                                admin_client_conf={},
                                                common_client_conf={},
+                                                consumer_count=2,


Same (should be threads_per_worker)

cmccabe · 2018-11-09T18:39:39Z

Thanks @stanislavkozlovski . Looks good. Should be ready to commit once the variable name is fixed in the python code.

stanislavkozlovski · 2018-11-10T08:57:17Z

@cmccabe the JDK 8 tests passed and the JDK11 failures look like unrelated flaky tests

Add threads with separate consumers to ConsumeBenchWorker. Update the Trogdor test scripts and documentation with the new functionality. Reviewers: Colin McCabe <cmccabe@apache.org>

stanislavkozlovski added 11 commits November 1, 2018 01:22

KAFKA-7514: Add threads to ConsumeBenchWorker

a322919

Add status updaters for the separate consumers

956231a

Ensure ConsumeBench does not consume more than max messages

4acb629

Also call consumer#unsubscribe() and consumer#close() on finishing the task

Do not close ConsumeBenchWorker on a single consumer status update fa…

63ebaa9

…ilure Abort the worker if every status updater fails

Add assignedPartitions field to consumer status

7a11a49

Also bundle up clientId and KafkaConsumer into a single class

Ensure thread-safe access to KafkaConsumer by wrapping operations in …

82383ad

…private Consumer class

Make Consumer#close() idempotent

9e2318d

Update tests

e0ab43d

Update Trogdor test scripts to use new functionality

72c9bf6

Update documentation

5885215

Handle undefined targetMessagesPerSec in ConsumeBenchSpec

86b7e19

stanislavkozlovski commented Nov 5, 2018

View reviewed changes

stanislavkozlovski added 4 commits November 5, 2018 19:34

Rename consumeCount->threadCount

00aaa55

Stylistic changes

4feca2c

Revert "Do not close ConsumeBenchWorker on a single consumer status u…

32cc8f8

…pdate failure" This reverts commit 63ebaa9.

Rename inner Consumer class and change to static inner class (findbug…

f51041e

…s warning)

cmccabe reviewed Nov 6, 2018

View reviewed changes

Small changes

b0cae79

Better use of clientId() method Rename threadCount to threadsPerWorker Improve ConsumeBenchSpec javadoc

cmccabe reviewed Nov 9, 2018

View reviewed changes

Rename consumer_count to threads_per_worker

119fec4

cmccabe merged commit 8259fda into apache:trunk Nov 13, 2018

Conversation

stanislavkozlovski commented Oct 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmccabe commented Nov 5, 2018

Uh oh!

stanislavkozlovski commented Nov 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmccabe commented Nov 9, 2018

Uh oh!

stanislavkozlovski commented Nov 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stanislavkozlovski commented Oct 31, 2018 •

edited

Loading