[KAFKA-6608] Add timeout parameter to methods which retrieves offsets by ConcurrencyPractitioner · Pull Request #5014 · apache/kafka

ConcurrencyPractitioner · 2018-05-12T18:43:37Z

Currently, this PR is based off of what was agreed upon in KIP-266. For further information, please look in this link:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75974886

ConcurrencyPractitioner · 2018-05-14T02:11:15Z

@hachikuji Do you have any comments?

hachikuji · 2018-05-26T18:54:36Z

@ConcurrencyPractitioner Sorry for the delay. Can you rebase please? I have merged @vvcephei's patch which includes adding timeout behavior for fetching committed offsets.

hachikuji · 2018-05-26T18:57:02Z

Also, do you plan to implement the rest of the timeout APIs or shall I create a separate JIRA?

ConcurrencyPractitioner · 2018-05-26T23:48:40Z

Test failures appear to be unrelated. JDK10 and JDK8 both have one failing test, but are different ones.
They are probably slightly flaky.

ConcurrencyPractitioner · 2018-05-27T00:23:11Z

Hi @hachikuji, I would be able to cover the remaining methods in KIP-266. Some the methods which I added however does not throw TimeoutException as the method description implies. I will get around to having those methods throw TimeoutException if timeout is exceeded. But other than that, please review. Thanks.

hachikuji

Thanks, left a few comments.

hachikuji · 2018-05-26T18:58:31Z

        this.assignors = assignors;
    }

+    public long requestTimeoutMs() {


We shouldn't be adding unneeded public APIs.

hachikuji · 2018-05-28T06:06:28Z

    /**
     * @see KafkaConsumer#commitSync(Map)
     */
+    @Deprecated


According to the KIP, the only methods to be deprecated are close(TimeUnit, long) and poll(long).

hachikuji · 2018-05-28T06:06:44Z

+    /**
+     * @see KafkaConsumer#commitSync(Map, Duration
+     */
+    public void commitSync(final Map<TopicPartition, OffsetAndMetadata> offsets, 


nit: the public is redundant for an interface. Same for the other methods below.

hachikuji · 2018-05-28T06:20:10Z

+        acquireAndEnsureOpen();
+        try {
+            long currMillis = time.milliseconds();
+            Map<String, List<PartitionInfo>> topicMetadata = fetcher.getAllTopicMetadata(timeout.toMillis());


getAllTopicMetadata already raises TimeoutException, so I don't think we need any of this logic.

hachikuji · 2018-05-28T06:21:28Z

+    public Map<TopicPartition, OffsetAndTimestamp> offsetsForTimes(Map<TopicPartition, Long> timestampsToSearch, Duration timeout) {
+        acquireAndEnsureOpen();
+        try {
+            for (Map.Entry<TopicPartition, Long> entry : timestampsToSearch.entrySet()) {


This is the same logic that we have offsetsForTime(Map). We can avoid the duplication by having that method call this one. Similarly for the other APIs.

hachikuji · 2018-05-28T06:23:24Z

+     *             defined
+     * @return true if all assigned positions have a position
+     */
+    private boolean updateFetchPositions(long start, long timeoutMs) {


Can you explain why we cannot use the other updateFetchPositions?

This was before John changed the old updateFetchPositions to take in a timeout argument. Before that, I had to improvise and I thought this was needed. However, this code is now irrevalant so I will now just switch over.

hachikuji · 2018-05-28T06:24:39Z

+     * @param timeout   The maximum allowable duration of the method
+     * @throws TimeoutException if committed offsets cannot be retrieved within set amount of time
+     */
+    public void refreshCommittedOffsetsIfNeeded(long startMs, long timeout) {


Similar to the previous comment. Why do we need new methods that duplicate all this logic?

hachikuji · 2018-05-28T06:25:00Z

-
-        assertEquals(539L, consumer.position(tp0));
+
+        consumer.poll(0);


This change seems not needed.

hachikuji · 2018-05-28T06:25:20Z

@@ -1,4 +1,5 @@
 /**
+


nit: please remove this

hachikuji · 2018-05-28T06:26:43Z

+     * since how much time one would need to block for Kafka Streams is 
+     * still unknown
+     */
+    private static final int DEFAULT_BLOCKING_TIME = 20000;


I'm not sure I understand why this should be necessary. We're not removing any functionality from the consumer. If possible, I'd prefer to do changes for streams in a separate PR.

To tell the truth, this was never meant to be permanent. I did this simply as a temporary marker to allow me to pass the tests using KafkaStreams. I thought that in the future PR, we will remove this and come up with some other apparatus instead of what we have right now at the moment.

Can we leave this for a follow-up? I still don't understand why we should need to touch anything in streams.

@ConcurrencyPractitioner Can you respond here? I would prefer to keep the current streams implementation and leave improvements to a follow-up PR.

How about retrieving StreamsConfig.REQUEST_TIMEOUT_MS_CONFIG and using that as timeout ?

Why don't we just leave the current blocking implementation and fix this in a separate PR? Why does it need to be part of this PR?

Unless there is a good reason why these changes must be here, please revert them. This will block merging the PR.

See the comment in the change for StreamThread.java

Construction of StoreChangeLogReader in StreamThread needs timeout parameter.

final StoreChangelogReader changelogReader = new StoreChangelogReader(restoreConsumer, userStateRestoreListener, logContext);

I assume you suggest using Long.MAX_VALUE for that parameter.

I am suggesting leaving the code in as it is prior to this patch. We do not need to make the change in StoreChangelogReader to use the timeout in this patch. The old position() has not been deprecated and will continue to work as it has. In a follow-up, we can try to improve the behavior.

Yeah, I agree. Streams code shouldn't show up in this PR diff at all. If there are failing streams tests, then I think there is something wrong with the KafkaConsumer implementation.

hachikuji

Thanks for the updates. Left a few more comments.

hachikuji · 2018-05-28T17:49:52Z

+     * @see KafkaConsumer#commitSync(Map, Duration
+     */
+    void commitSync(final Map<TopicPartition, OffsetAndMetadata> offsets, 
+                   final Duration duration);


nit: misaligned

hachikuji · 2018-05-28T17:52:05Z

     * @throws org.apache.kafka.common.KafkaException for any other unrecoverable errors
     */
    public long position(TopicPartition partition) {
+        return position(partition, Duration.ofMillis(requestTimeoutMs));


We should preserve the current behavior, which is to block indefinitely for this method.

hachikuji · 2018-05-28T17:55:15Z

            Long offset = this.subscriptions.position(partition);
-            while (offset == null) {
+            final long startMs = time.milliseconds();
+            long finishMs = time.milliseconds();


nit: initialize to startMs?

hachikuji · 2018-05-28T17:57:55Z

-                    Long.MAX_VALUE
-                );
-            }
+            Map<TopicPartition, OffsetAndMetadata> offsets = coordinator.fetchCommittedOffsets(Collections.singleton(partition), 


The second argument to fetchCommittedOffsets is the timeout, not the current time. Also, when this method times out, it will return null. So we should check for that and raise TimeoutException.

hachikuji · 2018-05-28T18:01:11Z

            Map<String, List<PartitionInfo>> topicMetadata = fetcher.getTopicMetadata(
-                    new MetadataRequest.Builder(Collections.singletonList(topic), true), requestTimeoutMs);
+                    new MetadataRequest.Builder(Collections.singletonList(topic), true), timeoutMs);
+            if (topicMetadata.isEmpty()) {


This check is unnecessary since getTopicMetadata will raise TimeoutException if it times out. I realize there is a little inconsistency between some of these internal APIs. It is an area for improvement. The reason it is this way is that timeouts in poll() are "normal" and do not cause an exception.

hachikuji · 2018-05-28T18:06:49Z

-        consumer.poll(Duration.ZERO);

-        assertEquals(539L, consumer.position(tp0));
+        assertEquals(539L, consumer.position(tp0, Duration.ofSeconds(2)));


Are these changes needed?

hachikuji · 2018-05-28T18:07:57Z

  }

-  @Test
+  @Test(expected = classOf[org.apache.kafka.common.errors.TimeoutException])


Why are we changing the behavior of these tests? They are intended to verify ACL behavior. We should just keep the current implementation.

hachikuji · 2018-05-28T18:08:54Z

  }

-  @Test
+  @Test(expected = classOf[org.apache.kafka.common.errors.TimeoutException])


Same comment. If a test case did not previously timeout, we shouldn't change it.

hachikuji · 2018-05-28T18:09:05Z

@@ -1,4 +1,5 @@
 /**
+


nit: please remove

hachikuji · 2018-05-28T18:09:52Z

+     * since how much time one would need to block for Kafka Streams is 
+     * still unknown
+     */
+    private static final int DEFAULT_BLOCKING_TIME = 20000;


Can we leave this for a follow-up? I still don't understand why we should need to touch anything in streams.

hachikuji

A few more comments.

hachikuji · 2018-05-29T17:21:23Z

    void commitSync(Map<TopicPartition, OffsetAndMetadata> offsets);

+    /**
+     * @see KafkaConsumer#commitSync(Map, Duration


nit: missing end parethesis

hachikuji · 2018-05-29T17:26:25Z

    Map<TopicPartition, OffsetAndTimestamp> offsetsForTimes(Map<TopicPartition, Long> timestampsToSearch);

+    /**
+     * @see KafkaConsumer#offsetsForTimes(java.util.Map, Duration)


nit: we can drop the java.util. prefix on all of these methods.

hachikuji · 2018-05-29T17:27:49Z

     */
    @Override
    public void commitSync(final Map<TopicPartition, OffsetAndMetadata> offsets) {
+        commitSync(offsets, Duration.ofMillis(requestTimeoutMs));


We should use Long.MAX_VALUE to keep the current behavior.

hachikuji · 2018-05-29T17:31:20Z

                throw new IllegalArgumentException("You can only check the position for partitions assigned to this consumer.");
            Long offset = this.subscriptions.position(partition);
            while (offset == null) {
+                updateFetchPositions(10);


Why don't we just call position(partition, Duration.ofMillis(Long.MAX_VALUE)?

Not sure how important this is, but Long.MAX_VALUE isn't exactly the current behavior. It would only be a difference if someone is actually waiting for more than 24 days for the call to complete. This probably doesn't happen, but in my PR, I opted to just keep the exact current behavior of blocking forever. Please take this as a general remark.

That said, I'm not sure you meant to drop down from Long.MAX_VALUE and retryBackoffMs in the old code to 10 ms here.

hachikuji · 2018-05-29T17:33:32Z

     */
    @Override
    public OffsetAndMetadata committed(TopicPartition partition) {
+        return committed(partition, Duration.ofMillis(requestTimeoutMs));


We should use Long.MAX_VALUE to keep the current behavior.

hachikuji · 2018-05-29T17:40:44Z

                // batch update fetch positions for any partitions without a valid position
-                while (!updateFetchPositions(Long.MAX_VALUE)) {
-                    log.warn("Still updating fetch positions");
+                updateFetchPositions(timeout - (time.milliseconds() - startMs));


Can we use remainingTimeAtLeastZero? Also, if updateFetchPositions returns false, we can just break.

hachikuji · 2018-05-29T17:43:49Z

+                finishMs = time.milliseconds();
+                final long remainingTime = Math.max(0, timeout - (finishMs - startMs));
+
+                if (remainingTime > 0) {


We can probably skip this check and just do the following:

client.poll(remainingTimeAtLeastZero(timeout, finishMs - startMs)); offset = this.subscriptions.position(partition); finishMs = time.milliseconds();

hachikuji · 2018-05-29T17:45:49Z

+     * @throws org.apache.kafka.common.errors.AuthenticationException if authentication fails. See the exception for more details
+     * @throws org.apache.kafka.common.errors.AuthorizationException if not authorized to the topic(s). See the exception for more details
+     * @throws IllegalArgumentException if the target timestamp is negative
+     * @throws org.apache.kafka.common.errors.TimeoutException if the offset metadata could not be fetched before


This message is not accurate for this method. The other methods need similar changes.

I will probably remove these, particularly since we want to preserve the behavior of the old methods (e.g. blocking infinitely). In other words, these methods I think should never throw a TimeoutException.

That is not quite right. We do want them to throw TimeoutException. I was just pointing out that that the timeout used is the one passed directly, not the request timeout.

hachikuji · 2018-05-29T17:46:57Z

        ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(1));
        assertEquals(1, records.count());
-        assertEquals(11L, consumer.position(tp0));
+        assertEquals(11L, consumer.position(tp0, Duration.ofSeconds(2)));


Unless there's a good reason for these changes, can you revert them?

Hi @hachikuji, I reverted all test changes.

hachikuji · 2018-05-29T17:47:38Z

@@ -1,4 +1,5 @@
 /**
+


Again, please remove this.

vvcephei · 2018-05-29T19:32:09Z

        try {
-            coordinator.commitOffsetsSync(new HashMap<>(offsets), Long.MAX_VALUE);
+            if (!coordinator.commitOffsetsSync(new HashMap<>(offsets), totalWaitTime)) {
+                throw new TimeoutException("Commiting offsets synchronously took too long.");


Nit: committing

General remark: it might be nice for debugging to include the string value of duration instead of "too long".

vvcephei · 2018-05-29T19:33:46Z

+        final long totalWaitTime = duration.toMillis();
        try {
-            coordinator.commitOffsetsSync(new HashMap<>(offsets), Long.MAX_VALUE);
+            if (!coordinator.commitOffsetsSync(new HashMap<>(offsets), totalWaitTime)) {


Is this wrapping offsets to avoid mutating the input? If so, I think that Collections#unmodifiableMap is more efficient, I think.
(edit) Oh, I see this pre-existed your change.

So true story, we had a user who was passing a synchronized collection which ultimately caused a deadlock. We decided that it was worth the copy to keep such weird behavior out of the consumer internals.

vvcephei · 2018-05-29T19:50:25Z

Thanks for the patch @ConcurrencyPractitioner .

I've left a few comments.

I think that for KIP-266, we decided to deprecate the old methods, but I didn't see any deprecation annotations, except on close. Note that they should go at a minimum on the Consumer interface, since subclasses will inherit the deprecation. I recommend also adding them to KafkaConsumer, just because we have the JavaDoc on the implementation, not the interface (I'm not sure why).

ConcurrencyPractitioner · 2018-05-29T21:04:39Z

Hi @hachikuji, any more comments?

ConcurrencyPractitioner · 2018-05-29T21:44:36Z

Hi @vvcephei, not too sure on this point. Because @hachikuji told me that the deprecation tags were reserved for only close() and poll(), although this is contrary to what was agreed in the KIP.

hachikuji · 2018-05-29T21:58:50Z

@vvcephei Yeah, I didn't see in the KIP that the old methods were deprecated and I hadn't expected we would. Do you think there is a good reason to do so?

hachikuji

Thanks @ConcurrencyPractitioner. Left a few more small comments. I think this is close.

hachikuji · 2018-05-29T22:54:37Z


    /**
-     * Commit offsets returned on the last {@link #poll(Duration) poll()} for all the subscribed list of topics and partition.
+     * Commit offsets returned on the last {@link #poll(long) poll()} for all the subscribed list of topics and partition.


This seems inadvertent? We want the documentation to refer to the new poll() API.

hachikuji · 2018-05-29T22:56:02Z

+     *             configured groupId. See the exception for more details
+     * @throws org.apache.kafka.common.KafkaException for any other unrecoverable errors
+     */
+    public long position(TopicPartition partition, final Duration duration) {


nit: can we name the argument timeout? Users will see this name in javadocs, so we should let it be as descriptive as possible. Same for the other methods. To avoid name collisions below, you can use timeoutMs for example.

hachikuji · 2018-05-29T22:56:37Z

+     *             function is called
+     * @throws org.apache.kafka.common.errors.InterruptException if the calling thread is interrupted before or while
+     *             this function is called
+     * @throws org.apache.kafka.common.errors.TimeoutException if the method blocks for longer than requestTimoutMs


We need to update this to refer to the timeout argument, not the request timeout.

hachikuji · 2018-05-29T22:58:26Z

+     * @throws org.apache.kafka.common.errors.AuthenticationException if authentication fails. See the exception for more details
+     * @throws org.apache.kafka.common.errors.AuthorizationException if not authorized to the topic(s). See the exception for more details
+     * @throws IllegalArgumentException if the target timestamp is negative
+     * @throws org.apache.kafka.common.errors.TimeoutException if the offset metadata could not be fetched before


That is not quite right. We do want them to throw TimeoutException. I was just pointing out that that the timeout used is the one passed directly, not the request timeout.

hachikuji · 2018-05-29T23:00:39Z

     */
    @Override
    public Map<TopicPartition, Long> endOffsets(Collection<TopicPartition> partitions) {
+        return endOffsets(partitions, Duration.ofMillis(Long.MAX_VALUE));


There is some internal inconsistency which is probably causing some confusion. Only position(), commitSync() and committed() have the indefinite blocking behavior. The rest, including this one, should use the request timeout.

hachikuji · 2018-05-29T23:02:28Z


 import java.nio.ByteBuffer
 import java.util
+import java.time.Duration


nit: unneeded import. A couple more in ConsumerBounceTest and PlaintextConsumerTest.

ConcurrencyPractitioner · 2018-05-30T01:57:29Z

@hachikuji do you think this PR is ready?

hachikuji

LGTM. Thanks for the patch. Note I pushed a few minor tweaks to help get this over the line. I will fix the conflicts when I merge.

hachikuji · 2018-05-30T08:03:49Z

I will merge in the morning presuming there are no problems with the build.

apache#5014) This patch implements the consumer timeout APIs from KIP-266 (everything except `poll()`, which was done separately). Reviewers: John Roesler <john@confluent.io>, Jason Gustafson <jason@confluent.io>

Richard Yu added 5 commits May 12, 2018 11:43

[KAFKA-6608] Add timeout parameter to methods which retrieves offsets

bda35db

Computing conflicts

02568cc

Fixing errors

544d304

Fixing test errors

a951399

Removing weird checkstyle violation

b9fdabb

ijuma requested a review from hachikuji May 21, 2018 18:57

ijuma added this to the 2.0.0 milestone May 21, 2018

ijuma assigned hachikuji May 21, 2018

ConcurrencyPractitioner and others added 2 commits May 26, 2018 13:58

Merge branch 'trunk' into KAFKA-6608

5333e12

Resolving conflicts

ae6f8fe

Adding remaining methods found in KIP-266

451c0ce

Richard Yu and others added 2 commits May 26, 2018 17:40

Adding timeout exception to methods

bf9279a

Fixing comment description

dd5638f

hachikuji reviewed May 28, 2018

View reviewed changes

Richard Yu added 2 commits May 28, 2018 09:23

Addressing comments

b58421c

Fixing findBugs violation

2209e7a

hachikuji reviewed May 28, 2018

View reviewed changes

Fixing tests

0da881f

hachikuji reviewed May 29, 2018

View reviewed changes

vvcephei reviewed May 29, 2018

View reviewed changes

Resolving most comments

32b5685

hachikuji reviewed May 29, 2018

View reviewed changes

Adding some more comments

bcabe6d

Minor naming/doc tweaks

84ee586

hachikuji approved these changes May 30, 2018

View reviewed changes

hachikuji added 2 commits May 30, 2018 00:59

Merge branch 'trunk' into KAFKA-6608

09fbec3

two more minor nits

6b29836

hachikuji mentioned this pull request May 30, 2018

MINOR: Add upgrade notes for new consumer poll #5084

Merged

3 tasks

hachikuji merged commit f24a62d into apache:trunk May 30, 2018


		assertEquals(539L, consumer.position(tp0));

		consumer.poll(0);

Conversation

ConcurrencyPractitioner commented May 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ConcurrencyPractitioner commented May 14, 2018

Uh oh!

hachikuji commented May 26, 2018

Uh oh!

hachikuji commented May 26, 2018

Uh oh!

ConcurrencyPractitioner commented May 26, 2018

Uh oh!

ConcurrencyPractitioner commented May 27, 2018

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ConcurrencyPractitioner commented May 12, 2018 •

edited

Loading

vvcephei May 29, 2018 •

edited

Loading